Shounak Sural

CV
h-index26
7papers
28citations
Novelty46%
AI Score53

7 Papers

46.7ROMay 29
Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models

Nishad Sahu, Kalpana Panda, Congyuan Yu et al.

Recent end-to-end (E2E) autonomous driving policies achieve high driving scores in closed-loop simulations. Yet it remains unclear whether these policies handle common safety-critical scenarios. We present Safe2Drive (S2D), a set of Bench2Drive-aligned scenario extensions focused on three frequent families of road hazards: work zones, pedestrian jaywalking, and occluded vulnerable road users (VRUs). Safe2Drive adds 100 common but challenging scenarios and introduces SafeDriving Score (SDS), a safety-centric metric that augments prior evaluators with pre-crash braking, work zone-object contact, lane centering, and smoothness checks. Evaluating two state-of-the-art policies (LEAD and SimLingo) on S2D, we find that their driving scores drop sharply relative to their reported Bench2Drive baselines (LEAD: from 94.70 DS on Bench2Drive to 39.95 DS on S2D; SimLingo: from 85.07 DS on Bench2Drive to 41.00 DS on S2D) and that SDS on S2D is low (11.85 for LEAD and 15.27 for Sim-Lingo). These results are consistent with brittle safe-driving behaviors such as poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. This study highlights a lack of safe behavioral reasoning in E2E models even when tested on CARLA towns that are part of the training set. We plan to release the code and videos for all 100 S2D scenarios.

22.7CVMar 26Code
BEVMAPMATCH: Multimodal BEV Neural Map Matching for Robust Re-Localization of Autonomous Vehicles

Shounak Sural, Ragunathan Rajkumar

Localization in GNSS-denied and GNSS-degraded environments is a challenge for the safe widespread deployment of autonomous vehicles. Such GNSS-challenged environments require alternative methods for robust localization. In this work, we propose BEVMapMatch, a framework for robust vehicle re-localization on a known map without the need for GNSS priors. BEVMapMatch uses a context-aware lidar+camera fusion method to generate multimodal Bird's Eye View (BEV) segmentations around the ego vehicle in both good and adverse weather conditions. Leveraging a search mechanism based on cross-attention, the generated BEV segmentation maps are then used for the retrieval of candidate map patches for map-matching purposes. Finally, BEVMapMatch uses the top retrieved candidate for finer alignment against the generated BEV segmentation, achieving accurate global localization without the need for GNSS. Multiple frames of generated BEV segmentation further improve localization accuracy. Extensive evaluations show that BEVMapMatch outperforms existing methods for re-localization in GNSS-denied and adverse environments, with a Recall@1m of 39.8%, being nearly twice as much as the best performing re-localization baseline. Our code and data will be made available at https://github.com/ssuralcmu/BEVMapMatch.git.

CVApr 23, 2024Code
ContextualFusion: Context-Based Multi-Sensor Fusion for 3D Object Detection in Adverse Operating Conditions

Shounak Sural, Nishad Sahu, Ragunathan Rajkumar

The fusion of multimodal sensor data streams such as camera images and lidar point clouds plays an important role in the operation of autonomous vehicles (AVs). Robust perception across a range of adverse weather and lighting conditions is specifically required for AVs to be deployed widely. While multi-sensor fusion networks have been previously developed for perception in sunny and clear weather conditions, these methods show a significant degradation in performance under night-time and poor weather conditions. In this paper, we propose a simple yet effective technique called ContextualFusion to incorporate the domain knowledge about cameras and lidars behaving differently across lighting and weather variations into 3D object detection models. Specifically, we design a Gated Convolutional Fusion (GatedConv) approach for the fusion of sensor streams based on the operational context. To aid in our evaluation, we use the open-source simulator CARLA to create a multimodal adverse-condition dataset called AdverseOp3D to address the shortcomings of existing datasets being biased towards daytime and good-weather conditions. Our ContextualFusion approach yields an mAP improvement of 6.2% over state-of-the-art methods on our context-balanced synthetic dataset. Finally, our method enhances state-of-the-art 3D objection performance at night on the real-world NuScenes dataset with a significant mAP improvement of 11.7%.

CVAug 30, 2024
ContextVLM: Zero-Shot and Few-Shot Context Understanding for Autonomous Driving using Vision Language Models

Shounak Sural, Naren, Ragunathan Rajkumar

In recent years, there has been a notable increase in the development of autonomous vehicle (AV) technologies aimed at improving safety in transportation systems. While AVs have been deployed in the real-world to some extent, a full-scale deployment requires AVs to robustly navigate through challenges like heavy rain, snow, low lighting, construction zones and GPS signal loss in tunnels. To be able to handle these specific challenges, an AV must reliably recognize the physical attributes of the environment in which it operates. In this paper, we define context recognition as the task of accurately identifying environmental attributes for an AV to appropriately deal with them. Specifically, we define 24 environmental contexts capturing a variety of weather, lighting, traffic and road conditions that an AV must be aware of. Motivated by the need to recognize environmental contexts, we create a context recognition dataset called DrivingContexts with more than 1.6 million context-query pairs relevant for an AV. Since traditional supervised computer vision approaches do not scale well to a variety of contexts, we propose a framework called ContextVLM that uses vision-language models to detect contexts using zero- and few-shot approaches. ContextVLM is capable of reliably detecting relevant driving contexts with an accuracy of more than 95% on our dataset, while running in real-time on a 4GB Nvidia GeForce GTX 1050 Ti GPU on an AV with a latency of 10.5 ms per query.

63.6ROMay 16
DriveSafer: End-to-End Autonomous Driving with Safety Guidance

Shounak Sural, Raj Rajkumar

End-to-End (E2E) autonomous driving models have shown growing capability in recent years, with performance improving on increasingly challenging benchmarks. However, modern generative E2E planners still suffer from a substantial number of catastrophic failures in safety-critical scenarios. We find that many such failures arise from violations of physical constraints and safety requirements, leading to unsafe behavior. Motivated by this finding, in this paper, we focus on improving safety outcomes in generative end-to-end driving with a targeted reduction of catastrophic planning failures, instead of enhancing average planning quality. Towards this end, we propose DriveSafer, a failure-aware safety framework for end-to-end planners. DriveSafer explicitly steers generative planners towards safe behaviors leveraging both training-time safety constraints and inference-time safety guidance. Compared to the state-of-the-art DiffusionDrive model, on the NAVSIM benchmark, DriveSafer reduces the number of catastrophic failures (PDMS=0) by 48%, with over 65% reduction in drivable-area compliance failures.

CVOct 17, 2025
ObjectTransforms for Uncertainty Quantification and Reduction in Vision-Based Perception for Autonomous Vehicles

Nishad Sahu, Shounak Sural, Aditya Satish Patil et al.

Reliable perception is fundamental for safety critical decision making in autonomous driving. Yet, vision based object detector neural networks remain vulnerable to uncertainty arising from issues such as data bias and distributional shifts. In this paper, we introduce ObjectTransforms, a technique for quantifying and reducing uncertainty in vision based object detection through object specific transformations at both training and inference times. At training time, ObjectTransforms perform color space perturbations on individual objects, improving robustness to lighting and color variations. ObjectTransforms also uses diffusion models to generate realistic, diverse pedestrian instances. At inference time, object perturbations are applied to detected objects and the variance of detection scores are used to quantify predictive uncertainty in real time. This uncertainty signal is then used to filter out false positives and also recover false negatives, improving the overall precision recall curve. Experiments with YOLOv8 on the NuImages 10K dataset demonstrate that our method yields notable accuracy improvements and uncertainty reduction across all object classes during training, while predicting desirably higher uncertainty values for false positives as compared to true positives during inference. Our results highlight the potential of ObjectTransforms as a lightweight yet effective mechanism for reducing and quantifying uncertainty in vision-based perception during training and inference respectively.

ROOct 1, 2025
Physics-Informed Neural Controlled Differential Equations for Scalable Long Horizon Multi-Agent Motion Forecasting

Shounak Sural, Charles Kekeh, Wenliang Liu et al.

Long-horizon motion forecasting for multiple autonomous robots is challenging due to non-linear agent interactions, compounding prediction errors, and continuous-time evolution of dynamics. Learned dynamics of such a system can be useful in various applications such as travel time prediction, prediction-guided planning and generative simulation. In this work, we aim to develop an efficient trajectory forecasting model conditioned on multi-agent goals. Motivated by the recent success of physics-guided deep learning for partially known dynamical systems, we develop a model based on neural Controlled Differential Equations (CDEs) for long-horizon motion forecasting. Unlike discrete-time methods such as RNNs and transformers, neural CDEs operate in continuous time, allowing us to combine physics-informed constraints and biases to jointly model multi-robot dynamics. Our approach, named PINCoDE (Physics-Informed Neural Controlled Differential Equations), learns differential equation parameters that can be used to predict the trajectories of a multi-agent system starting from an initial condition. PINCoDE is conditioned on future goals and enforces physics constraints for robot motion over extended periods of time. We adopt a strategy that scales our model from 10 robots to 100 robots without the need for additional model parameters, while producing predictions with an average ADE below 0.5 m for a 1-minute horizon. Furthermore, progressive training with curriculum learning for our PINCoDE model results in a 2.7X reduction of forecasted pose error over 4 minute horizons compared to analytical models.