Ashish Bastola

CV
h-index30
11papers
54citations
Novelty43%
AI Score44

11 Papers

CVJan 26
Spatial-Conditioned Reasoning in Long-Egocentric Videos

James Tribble, Hao Wang, Si-En Hong et al.

Long-horizon egocentric video presents significant challenges for visual navigation due to viewpoint drift and the absence of persistent geometric context. Although recent vision-language models perform well on image and short-video reasoning, their spatial reasoning capability in long egocentric sequences remains limited. In this work, we study how explicit spatial signals influence VLM-based video understanding without modifying model architectures or inference procedures. We introduce Sanpo-D, a fine-grained re-annotation of the Google Sanpo dataset, and benchmark multiple VLMs on navigation-oriented spatial queries. To examine input-level inductive bias, we further fuse depth maps with RGB frames and evaluate their impact on spatial reasoning. Our results reveal a trade-off between general-purpose accuracy and spatial specialization, showing that depth-aware and spatially grounded representations can improve performance on safety-critical tasks such as pedestrian and obstruction detection.

CVJan 12
Motion Focus Recognition in Fast-Moving Egocentric Video

Daniel Hong, James Tribble, Hao Wang et al.

From Vision-Language-Action (VLA) systems to robotics, existing egocentric datasets primarily focus on action recognition tasks, while largely overlooking the inherent role of motion analysis in sports and other fast-movement scenarios. To bridge this gap, we propose a real-time motion focus recognition method that estimates the subject's locomotion intention from any egocentric video. Our approach leverages the foundation model for camera pose estimation and introduces system-level optimizations to enable efficient and scalable inference. Evaluated on a collected egocentric action dataset, our method achieves real-time performance with manageable memory consumption through a sliding batch inference strategy. This work makes motion-centric analysis practical for edge deployment and offers a complementary perspective to existing egocentric studies on sports and fast-movement activities.

AIJul 6, 2025Code
Anomalous Decision Discovery using Inverse Reinforcement Learning

Ashish Bastola, Mert D. Pesé, Long Cheng et al.

Anomaly detection plays a critical role in Autonomous Vehicles (AVs) by identifying unusual behaviors through perception systems that could compromise safety and lead to hazardous situations. Current approaches, which often rely on predefined thresholds or supervised learning paradigms, exhibit reduced efficacy when confronted with unseen scenarios, sensor noise, and occlusions, leading to potential safety-critical failures. Moreover, supervised methods require large annotated datasets, limiting their real-world feasibility. To address these gaps, we propose an anomaly detection framework based on Inverse Reinforcement Learning (IRL) to infer latent driving intentions from sequential perception data, thus enabling robust identification. Specifically, we present Trajectory-Reward Guided Adaptive Pre-training (TRAP), a novel IRL framework for anomaly detection, to address two critical limitations of existing methods: noise robustness and generalization to unseen scenarios. Our core innovation is implicitly learning temporal credit assignments via reward and worst-case supervision. We leverage pre-training with variable-horizon sampling to maximize time-to-consequence, resulting in early detection of behavior deviation. Experiments on 14,000+ simulated trajectories demonstrate state-of-the-art performance, achieving 0.90 AUC and 82.2\% F1-score - outperforming similarly trained supervised and unsupervised baselines by 39\% on Recall and 12\% on F1-score, respectively. Similar performance is achieved while exhibiting robustness to various noise types and generalization to unseen anomaly types. Our code will be available at: https://github.com/abastola0/TRAP.git

CVMar 6, 2024
FLAME Diffuser: Wildfire Image Synthesis using Mask Guided Diffusion

Hao Wang, Sayed Pedram Haeri Boroujeni, Xiwen Chen et al.

Wildfires are a significant threat to ecosystems and human infrastructure, leading to widespread destruction and environmental degradation. Recent advancements in deep learning and generative models have enabled new methods for wildfire detection and monitoring. However, the scarcity of annotated wildfire images limits the development of robust models for these tasks. In this work, we present the FLAME Diffuser, a training-free, diffusion-based framework designed to generate realistic wildfire images with paired ground truth. Our framework uses augmented masks, sampled from real wildfire data, and applies Perlin noise to guide the generation of realistic flames. By controlling the placement of these elements within the image, we ensure precise integration while maintaining the original images style. We evaluate the generated images using normalized Frechet Inception Distance, CLIP Score, and a custom CLIP Confidence metric, demonstrating the high quality and realism of the synthesized wildfire images. Specifically, the fusion of Perlin noise in this work significantly improved the quality of synthesized images. The proposed method is particularly valuable for enhancing datasets used in downstream tasks such as wildfire detection and monitoring.

CVApr 25, 2024
Motor Focus: Fast Ego-Motion Prediction for Assistive Visual Navigation

Hao Wang, Jiayou Qin, Xiwen Chen et al.

Assistive visual navigation systems for visually impaired individuals have become increasingly popular thanks to the rise of mobile computing. Most of these devices work by translating visual information into voice commands. In complex scenarios where multiple objects are present, it is imperative to prioritize object detection and provide immediate notifications for key entities in specific directions. This brings the need for identifying the observer's motion direction (ego-motion) by merely processing visual information, which is the key contribution of this paper. Specifically, we introduce Motor Focus, a lightweight image-based framework that predicts the ego-motion - the humans (and humanoid machines) movement intentions based on their visual feeds, while filtering out camera motion without any camera calibration. To this end, we implement an optical flow-based pixel-wise temporal analysis method to compensate for the camera motion with a Gaussian aggregation to smooth out the movement prediction area. Subsequently, to evaluate the performance, we collect a dataset including 50 clips of pedestrian scenes in 5 different scenarios. We tested this framework with classical feature detectors such as SIFT and ORB to show the comparison. Our framework demonstrates its superiority in speed (> 40FPS), accuracy (MAE = 60pixels), and robustness (SNR = 23dB), confirming its potential to enhance the usability of vision-based assistive navigation tools in complex environments.

CVJan 1, 2025
Diffusion Prism: Enhancing Diversity and Morphology Consistency in Mask-to-Image Diffusion

Hao Wang, Xiwen Chen, Ashish Bastola et al.

The emergence of generative AI and controllable diffusion has made image-to-image synthesis increasingly practical and efficient. However, when input images exhibit low entropy and sparse, the inherent characteristics of diffusion models often result in limited diversity. This constraint significantly interferes with data augmentation. To address this, we propose Diffusion Prism, a training-free framework that efficiently transforms binary masks into realistic and diverse samples while preserving morphological features. We explored that a small amount of artificial noise will significantly assist the image-denoising process. To prove this novel mask-to-image concept, we use nano-dendritic patterns as an example to demonstrate the merit of our method compared to existing controllable diffusion models. Furthermore, we extend the proposed framework to other biological patterns, highlighting its potential applications across various fields.

CVNov 20, 2024
RobustFormer: Noise-Robust Pre-training for images and videos

Ashish Bastola, Nishant Luitel, Hao Wang et al.

While deep learning models are powerful tools that revolutionized many areas, they are also vulnerable to noise as they rely heavily on learning patterns and features from the exact details of the clean data. Transformers, which have become the backbone of modern vision models, are no exception. Current Discrete Wavelet Transforms (DWT) based methods do not benefit from masked autoencoder (MAE) pre-training since the inverse DWT (iDWT) introduced in these approaches is computationally inefficient and lacks compatibility with video inputs in transformer architectures. In this work, we present RobustFormer, a method that overcomes these limitations by enabling noise-robust pre-training for both images and videos; improving the efficiency of DWT-based methods by removing the need for computationally iDWT steps and simplifying the attention mechanism. To our knowledge, the proposed method is the first DWT-based method compatible with video inputs and masked pre-training. Our experiments show that MAE-based pre-training allows us to bypass the iDWT step, greatly reducing computation. Through extensive tests on benchmark datasets, RobustFormer achieves state-of-the-art results for both image and video tasks.

CVDec 14, 2025
Fast 2DGS: Efficient Image Representation with Deep Gaussian Prior

Hao Wang, Ashish Bastola, Chaoyi Zhou et al.

As generative models become increasingly capable of producing high-fidelity visual content, the demand for efficient, interpretable, and editable image representations has grown substantially. Recent advances in 2D Gaussian Splatting (2DGS) have emerged as a promising solution, offering explicit control, high interpretability, and real-time rendering capabilities (>1000 FPS). However, high-quality 2DGS typically requires post-optimization. Existing methods adopt random or heuristics (e.g., gradient maps), which are often insensitive to image complexity and lead to slow convergence (>10s). More recent approaches introduce learnable networks to predict initial Gaussian configurations, but at the cost of increased computational and architectural complexity. To bridge this gap, we present Fast-2DGS, a lightweight framework for efficient Gaussian image representation. Specifically, we introduce Deep Gaussian Prior, implemented as a conditional network to capture the spatial distribution of Gaussian primitives under different complexities. In addition, we propose an attribute regression network to predict dense Gaussian properties. Experiments demonstrate that this disentangled architecture achieves high-quality reconstruction in a single forward pass, followed by minimal fine-tuning. More importantly, our approach significantly reduces computational cost without compromising visual quality, bringing 2DGS closer to industry-ready deployment.

MAJan 28, 2025
Anomaly Detection in Cooperative Vehicle Perception Systems under Imperfect Communication

Ashish Bastola, Hao Wang, Abolfazl Razi

Anomaly detection is a critical requirement for ensuring safety in autonomous driving. In this work, we leverage Cooperative Perception to share information across nearby vehicles, enabling more accurate identification and consensus of anomalous behaviors in complex traffic scenarios. To account for the real-world challenge of imperfect communication, we propose a cooperative-perception-based anomaly detection framework (CPAD), which is a robust architecture that remains effective under communication interruptions, thereby facilitating reliable performance even in low-bandwidth settings. Since no multi-agent anomaly detection dataset exists for vehicle trajectories, we introduce 15,000 different scenarios with a 90,000 trajectories benchmark dataset generated through rule-based vehicle dynamics analysis. Empirical results demonstrate that our approach outperforms standard anomaly classification methods in F1-score, AUC and showcase strong robustness to agent connection interruptions.

CVMar 19, 2024
VisionGPT: LLM-Assisted Real-Time Anomaly Detection for Safe Visual Navigation

Hao Wang, Jiayou Qin, Ashish Bastola et al.

This paper explores the potential of Large Language Models(LLMs) in zero-shot anomaly detection for safe visual navigation. With the assistance of the state-of-the-art real-time open-world object detection model Yolo-World and specialized prompts, the proposed framework can identify anomalies within camera-captured frames that include any possible obstacles, then generate concise, audio-delivered descriptions emphasizing abnormalities, assist in safe visual navigation in complex circumstances. Moreover, our proposed framework leverages the advantages of LLMs and the open-vocabulary object detection model to achieve the dynamic scenario switch, which allows users to transition smoothly from scene to scene, which addresses the limitation of traditional visual navigation. Furthermore, this paper explored the performance contribution of different prompt components, provided the vision for future improvement in visual accessibility, and paved the way for LLMs in video anomaly detection and vision-language understanding.

HCJan 26, 2024
Driving Towards Inclusion: A Systematic Review of AI-powered Accessibility Enhancements for People with Disability in Autonomous Vehicles

Ashish Bastola, Hao Wang, Sayed Pedram Haeri Boroujeni et al.

This paper provides a comprehensive and, to our knowledge, the first review of inclusive human-computer interaction (HCI) within autonomous vehicles (AVs) and human-driven cars with partial autonomy, emphasizing accessibility and user-centered design principles. We explore the current technologies and HCI systems designed to enhance passenger experience, particularly for individuals with accessibility needs. Key technologies discussed include brain-computer interfaces, anthropomorphic interaction, virtual reality, augmented reality, mode adaptation, voice-activated interfaces, haptic feedback, etc. Each technology is evaluated for its role in creating an inclusive in-vehicle environment. Furthermore, we highlight recent interface designs by leading companies and review emerging concepts and prototypes under development or testing, which show significant potential to address diverse accessibility requirements. Safety considerations, ethical concerns, and adoption of AVs are other major issues that require thorough investigation. Building on these findings, we propose an end-to-end design framework that addresses accessibility requirements across diverse user demographics, including older adults and individuals with physical or cognitive impairments. This work provides actionable insights for designers, researchers, and policymakers aiming to create safer and more comfortable environments in autonomous and regular vehicles accessible to all users.