Ravishankar K. Iyer

DC
h-index20
14papers
729citations
Novelty48%
AI Score41

14 Papers

ROJun 2, 2022
Watch Out for the Safety-Threatening Actors: Proactively Mitigating Safety Hazards

Saurabh Jha, Shengkun Cui, Zbigniew Kalbarczyk et al.

Despite the successful demonstration of autonomous vehicles (AVs), such as self-driving cars, ensuring AV safety remains a challenging task. Although some actors influence an AV's driving decisions more than others, current approaches pay equal attention to each actor on the road. An actor's influence on the AV's decision can be characterized in terms of its ability to decrease the number of safe navigational choices for the AV. In this work, we propose a safety threat indicator (STI) using counterfactual reasoning to estimate the importance of each actor on the road with respect to its influence on the AV's safety. We use this indicator to (i) characterize the existing real-world datasets to identify rare hazardous scenarios as well as the poor performance of existing controllers in such scenarios; and (ii) design an RL based safety mitigation controller to proactively mitigate the safety hazards those actors pose to the AV. Our approach reduces the accident rate for the state-of-the-art AV agent(s) in rare hazardous scenarios by more than 70%.

QMOct 2, 2023
REMEDI: REinforcement learning-driven adaptive MEtabolism modeling of primary sclerosing cholangitis DIsease progression

Chang Hu, Krishnakant V. Saboo, Ahmad H. Ali et al.

Primary sclerosing cholangitis (PSC) is a rare disease wherein altered bile acid metabolism contributes to sustained liver injury. This paper introduces REMEDI, a framework that captures bile acid dynamics and the body's adaptive response during PSC progression that can assist in exploring treatments. REMEDI merges a differential equation (DE)-based mechanistic model that describes bile acid metabolism with reinforcement learning (RL) to emulate the body's adaptations to PSC continuously. An objective of adaptation is to maintain homeostasis by regulating enzymes involved in bile acid metabolism. These enzymes correspond to the parameters of the DEs. REMEDI leverages RL to approximate adaptations in PSC, treating homeostasis as a reward signal and the adjustment of the DE parameters as the corresponding actions. On real-world data, REMEDI generated bile acid dynamics and parameter adjustments consistent with published findings. Also, our results support discussions in the literature that early administration of drugs that suppress bile acid synthesis may be effective in PSC treatment.

DCApr 12, 2024Code
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Haoran Qiu, Weichao Mao, Archit Patke et al.

Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line blocking issues. To address the non-deterministic nature of LLMs and enable efficient interactive LLM serving, we present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths. Our open-source SSJF implementation does not require changes to memory management or batching strategies. Evaluations on real-world datasets and production workload traces show that SSJF reduces average job completion times by 30.5-39.6% and increases throughput by 2.2-3.6x compared to FCFS schedulers, across no batching, dynamic batching, and continuous batching settings.

DCDec 26, 2025
Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

Shengkun Cui, Rahul Krishna, Saurabh Jha et al.

Cloud incidents pose major operational challenges in production, with unresolved production cloud incidents cost on average over $2M per hour. Prior research identifies code- and configuration-related issues as the predominant category of root causes in cloud incidents. This paper introduces PRAXIS, an orchestrator that manages and deploys an agentic workflow for diagnosing code- and configuration-caused cloud incidents. PRAXIS employs an LLM-driven structured traversal over two types of graph: (1) a service dependency graph (SDG) that captures microservice-level dependencies; and (2) a hammock-block program dependence graph (PDG) that captures code-level dependencies for each microservice. Together, these graphs encode microservice- and code-level dependencies and the LLM acts as a traversal policy over these graphs, moving between services and code dependencies to localize and explain failures. Compared to state-of-the-art ReAct baselines, PRAXIS improves RCA accuracy by up to 3.1x while reducing token consumption by 3.8x. PRAXIS is demonstrated on a set of 30 comprehensive real-world incidents that is being compiled into an RCA benchmark.

SEJul 1, 2019Code
Kayotee: A Fault Injection-based System to Assess the Safety and Reliability of Autonomous Vehicles to Faults and Errors

Saurabh Jha, Timothy Tsai, Siva Hari et al.

Fully autonomous vehicles (AVs), i.e., AVs with autonomy level 5, are expected to dominate road transportation in the near-future and contribute trillions of dollars to the global economy. The general public, government organizations, and manufacturers all have significant concern regarding resiliency and safety standards of the autonomous driving system (ADS) of AVs . In this work, we proposed and developed (a) `Kayotee' - a fault injection-based tool to systematically inject faults into software and hardware components of the ADS to assess the safety and reliability of AVs to faults and errors, and (b) an ontology model to characterize errors and safety violations impacting reliability and safety of AVs. Kayotee is capable of characterizing fault propagation and resiliency at different levels - (a) hardware, (b) software, (c) vehicle dynamics, and (d) traffic resilience. We used Kayotee to study a proprietary ADS technology built by Nvidia corporation and are currently applying Kayotee to other open-source ADS systems.

ROApr 27, 2015Code
Systems-theoretic Safety Assessment of Robotic Telesurgical Systems

Homa Alemzadeh, Daniel Chen, Andrew Lewis et al.

Robotic telesurgical systems are one of the most complex medical cyber-physical systems on the market, and have been used in over 1.75 million procedures during the last decade. Despite significant improvements in design of robotic surgical systems through the years, there have been ongoing occurrences of safety incidents during procedures that negatively impact patients. This paper presents an approach for systems-theoretic safety assessment of robotic telesurgical systems using software-implemented fault-injection. We used a systemstheoretic hazard analysis technique (STPA) to identify the potential safety hazard scenarios and their contributing causes in RAVEN II robot, an open-source robotic surgical platform. We integrated the robot control software with a softwareimplemented fault-injection engine which measures the resilience of the system to the identified safety hazard scenarios by automatically inserting faults into different parts of the robot control software. Representative hazard scenarios from real robotic surgery incidents reported to the U.S. Food and Drug Administration (FDA) MAUDE database were used to demonstrate the feasibility of the proposed approach for safety-based design of robotic telesurgical systems.

DCMar 14, 2025
Characterizing GPU Resilience and Impact on AI/HPC Systems

Shengkun Cui, Archit Patke, Hung Nguyen et al.

This study characterizes GPU resilience in Delta HPC, a large-scale AI system that consists of 1,056 A100 and H100 GPUs, with over 1,300 petaflops of peak throughput. Delta HPC is operated by the National Center for Supercomputing Applications (NCSA) at the University of Illinois Urbana-Champaign. We used 2.5 years of operational data (11.7 million GPU hours) on GPU errors. Our major findings include: (i) H100 GPU memory resilience is worse than A100 GPU memory, with 3.2x lower per-GPU MTBE for memory errors, (ii) The GPU memory error-recovery mechanisms on H100 GPUs are insufficient to handle the increased memory capacity, (iii) H100 GPUs demonstrate significantly improved GPU hardware resilience over A100 GPUs with respect to critical hardware components, (iv) GPU errors on both A100 and H100 GPUs frequently result in job failures due to the lack of robust recovery mechanisms at the application level, and (v) We project the impact of GPU node availability on larger-scales and find that significant overprovisioning of 5% is necessary to handle GPU failures.

AIOct 19, 2021
Watch out for the risky actors: Assessing risk in dynamic environments for safe driving

Saurabh Jha, Yan Miao, Zbigniew Kalbarczyk et al.

Driving in a dynamic environment that consists of other actors is inherently a risky task as each actor influences the driving decision and may significantly limit the number of choices in terms of navigation and safety plan. The risk encountered by the Ego actor depends on the driving scenario and the uncertainty associated with predicting the future trajectories of the other actors in the driving scenario. However, not all objects pose a similar risk. Depending on the object's type, trajectory, position, and the associated uncertainty with these quantities; some objects pose a much higher risk than others. The higher the risk associated with an actor, the more attention must be directed towards that actor in terms of resources and safety planning. In this paper, we propose a novel risk metric to calculate the importance of each actor in the world and demonstrate its usefulness through a case study.

LGJun 30, 2021
Reinforcement Learning based Disease Progression Model for Alzheimer's Disease

Krishnakant V. Saboo, Anirudh Choudhary, Yurui Cao et al.

We model Alzheimer's disease (AD) progression by combining differential equations (DEs) and reinforcement learning (RL) with domain knowledge. DEs provide relationships between some, but not all, factors relevant to AD. We assume that the missing relationships must satisfy general criteria about the working of the brain, for e.g., maximizing cognition while minimizing the cost of supporting cognition. This allows us to extract the missing relationships by using RL to optimize an objective (reward) function that captures the above criteria. We use our model consisting of DEs (as a simulator) and the trained RL agent to predict individualized 10-year AD progression using baseline (year 0) features on synthetic and real data. The model was comparable or better at predicting 10-year cognition trajectories than state-of-the-art learning-based models. Our interpretable model demonstrated, and provided insights into, "recovery/compensatory" processes that mitigate the effect of AD, even though those processes were not explicitly encoded in the model. Our framework combines DEs with RL for modelling AD progression and has broad applicability for understanding other neurological disorders.

DCFeb 22, 2021
BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics

Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk et al.

Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in HPC measurements by using a domain-driven Bayesian model that captures microarchitectural relationships between HPCs to jointly infer their values as probability distributions. We provide the design and implementation of an accelerator that allows for low-latency and low-power inference of the BayesPerf model for x86 and ppc64 CPUs. BayesPerf reduces the average error in HPC measurements from 40.1% to 7.6% when events are being multiplexed. The value of BayesPerf in real-time decision-making is illustrated with a simple example of scheduling of PCIe transfers.

DCSep 4, 2019
Inductive-bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters

Subho S Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk et al.

The problem of scheduling of workloads onto heterogeneous processors (e.g., CPUs, GPUs, FPGAs) is of fundamental importance in modern data centers. Current system schedulers rely on application/system-specific heuristics that have to be built on a case-by-case basis. Recent work has demonstrated ML techniques for automating the heuristic search by using black-box approaches which require significant training data and time, which make them challenging to use in practice. This paper presents Symphony, a scheduling framework that addresses the challenge in two ways: (i) a domain-driven Bayesian reinforcement learning (RL) model for scheduling, which inherently models the resource dependencies identified from the system architecture; and (ii) a sampling-based technique to compute the gradients of a Bayesian model without performing full probabilistic inference. Together, these techniques reduce both the amount of training data and the time required to produce scheduling policies that significantly outperform black-box approaches by up to 2.2x.

DCJul 24, 2019
Live Forensics for Distributed Storage Systems

Saurabh Jha, Shengkun Cui, Tianyin Xu et al.

We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key features: 1) using temporal and spatial differential observability for end-to-end performance monitoring of I/O requests, 2) modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements, and, 3) observing differences in reliability and performance metrics between similar types of healthy and unhealthy components to attribute the most likely root causes. We deployed Kaleidoscope on PetaStore and our evaluation shows that Kaleidoscope can run live forensics at 5-minute intervals and pinpoint the root causes of 95.8% of real-world performance issues, with negligible monitoring overhead.

LGJul 1, 2019
ML-based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection

Saurabh Jha, Subho S. Banerjee, Timothy Tsai et al.

The safety and resilience of fully autonomous vehicles (AVs) are of significant concern, as exemplified by several headline-making accidents. While AV development today involves verification, validation, and testing, end-to-end assessment of AV systems under accidental faults in realistic driving scenarios has been largely unexplored. This paper presents DriveFI, a machine learning-based fault injection engine, which can mine situations and faults that maximally impact AV safety, as demonstrated on two industry-grade AV technology stacks (from NVIDIA and Baidu). For example, DriveFI found 561 safety-critical faults in less than 4 hours. In comparison, random injection experiments executed over several weeks could not find any safety-critical faults

ROJul 13, 2015
Adverse Events in Robotic Surgery: A Retrospective Study of 14 Years of FDA Data

Homa Alemzadeh, Ravishankar K. Iyer, Zbigniew Kalbarczyk et al.

Understanding the causes and patient impacts of surgical adverse events will help improve systems and operational practices to avoid incidents in the future. We analyzed the adverse events data related to robotic systems and instruments used in minimally invasive surgery, reported to the U.S. FDA MAUDE database from January 2000 to December 2013. We determined the number of events reported per procedure and per surgical specialty, the most common types of device malfunctions and their impact on patients, and the causes for catastrophic events such as major complications, patient injuries, and deaths. During the study period, 144 deaths (1.4% of the 10,624 reports), 1,391 patient injuries (13.1%), and 8,061 device malfunctions (75.9%) were reported. The numbers of injury and death events per procedure have stayed relatively constant since 2007 (mean = 83.4, 95% CI, 74.2-92.7). Surgical specialties, for which robots are extensively used, such as gynecology and urology, had lower number of injuries, deaths, and conversions per procedure than more complex surgeries, such as cardiothoracic and head and neck (106.3 vs. 232.9, Risk Ratio = 2.2, 95% CI, 1.9-2.6). Device and instrument malfunctions, such as falling of burnt/broken pieces of instruments into the patient (14.7%), electrical arcing of instruments (10.5%), unintended operation of instruments (8.6%), system errors (5%), and video/imaging problems (2.6%), constituted a major part of the reports. Device malfunctions impacted patients in terms of injuries or procedure interruptions. In 1,104 (10.4%) of the events, the procedure was interrupted to restart the system (3.1%), to convert the procedure to non-robotic techniques (7.3%), or to reschedule it to a later time (2.5%). Adoption of advanced techniques in design and operation of robotic surgical systems may reduce these preventable incidents in the future.