Keshav Pingali

LG
h-index56
12papers
19citations
Novelty54%
AI Score53

12 Papers

DCMay 22Code
VLCs: Managing Parallelism with Virtualized Libraries

Yineng Yan, William Ruys, Hochan Lee et al.

As the complexity and scale of modern parallel machines continue to grow, programmers increasingly rely on composition of software libraries to encapsulate and exploit parallelism. However, many libraries are not designed with composition in mind and assume they have exclusive access to all resources. Using such libraries concurrently can result in contention and degraded performance. Prior solutions involve modifying the libraries or the OS, which is often infeasible. We propose Virtual Library Contexts (VLCs), which are process subunits that encapsulate sets of libraries and associated resource allocations. VLCs control the resource utilization of these libraries without modifying library code. This enables the user to partition resources between libraries to prevent contention, or load multiple copies of the same library to allow parallel execution of otherwise thread-unsafe code within the same process. In this paper, we describe and evaluate C++ and Python prototypes of VLCs. Experiments show VLCs enable a speedup up to 2.85x on benchmarks including applications using OpenMP, OpenBLAS, and LibTorch. Source code of VLCs is available at https://github.com/pecos/Virtual-Library-Context.

CESep 24, 2023Code
Physics Informed Neural Network Code for 2D Transient Problems (PINN-2DT) Compatible with Google Colab

Paweł Maczuga, Maciej Sikora, Maciej Skoczeń et al.

We present an open-source Physics Informed Neural Network environment for simulations of transient phenomena on two-dimensional rectangular domains, with the following features: (1) it is compatible with Google Colab which allows automatic execution on cloud environment; (2) it supports two dimensional time-dependent PDEs; (3) it provides simple interface for definition of the residual loss, boundary condition and initial loss, together with their weights; (4) it support Neumann and Dirichlet boundary conditions; (5) it allows for customizing the number of layers and neurons per layer, as well as for arbitrary activation function; (6) the learning rate and number of epochs are available as parameters; (7) it automatically differentiates PINN with respect to spatial and temporal variables; (8) it provides routines for plotting the convergence (with running average), initial conditions learnt, 2D and 3D snapshots from the simulation and movies (9) it includes a library of problems: (a) non-stationary heat transfer; (b) wave equation modeling a tsunami; (c) atmospheric simulations including thermal inversion; (d) tumor growth simulations.

LGNov 3, 2025
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

Bozhi You, Irene Wang, Zelal Su Mustafaoglu et al.

Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.

LGMay 3Code
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

Sungyoung Lee, Dohyeong Kim, Eshan Balachandar et al.

We propose Flow-Anchored Noise-conditioned Q-Learning (FAN), a highly efficient and high-performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that utilizes only a single flow policy iteration and requires only a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes. We release our code at https://github.com/brianlsy98/FAN.

RONov 5, 2018Code
SLAMBooster: An Application-aware Controller for Approximation in SLAM

Yan Pei, Swarnendu Biswas, Donald S. Fussell et al.

Simultaneous Localization and Mapping (SLAM) is the problem of constructing a map of a mobile agent's environment while localizing the agent within the map. Dense SLAM algorithms perform reconstruction and localization at pixel granularity. These algorithms require a lot of computational power, which has hindered their use on low-power resource-constrained devices. Approximate computing can be used to speed up SLAM implementations as long as the approximations do not prevent the agent from navigating correctly through the environment. Previous studies of approximation in SLAM have assumed that the entire trajectory of the agent is known before the agent starts, and they have focused on offline controllers that set approximation knobs at the start of the trajectory. In practice, the trajectory is not known ahead of time, and allowing knob settings to change dynamically opens up more opportunities for reducing computation time and energy. We describe SLAMBooster, an application-aware, online control system for dense SLAM that adaptively controls approximation knobs during the motion of the agent. SLAMBooster is based on a control technique called proportional-integral-derivative (PID) controller but our experiments showed this application-agnostic controller led to an unacceptable reduction in localization accuracy. To address this problem, SLAMBooster also exploits domain knowledge for controlling approximation by performing smooth surface detection and pose correction. We implemented SLAMBooster in the open-source SLAMBench framework and evaluated it on several trajectories. Our experiments show that on the average, SLAMBooster reduces the computation time by 72% and energy consumption by 35% on an embedded platform, while maintaining the accuracy of localization within reasonable bounds. These improvements make it feasible to deploy SLAM on a wider range of devices.

HCMar 26, 2025
TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews

Huimin Xu, Seungjun Yi, Terence Lim et al.

Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. TA provides valuable insights in healthcare but is resource-intensive. Large Language Models (LLMs) have been introduced to perform TA, yet their applications in healthcare remain unexplored. Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-Agent LLMs for clinical interviews. We leverage the scalability and coherence of multi-agent systems through structured conversations between agents and coordinate the expertise of cardiac experts in TA. Using interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we demonstrate that TAMA outperforms existing LLM-assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness. TAMA demonstrates strong potential for automated TA in clinical settings by leveraging multi-agent LLM systems with human-in-the-loop integration by enhancing quality while significantly reducing manual workload.

LGJan 13, 2025
HyperQuery: Beyond Binary Link Prediction

Sepideh Maleki, Josh Vekhter, Keshav Pingali

Groups with complex set intersection relations are a natural way to model a wide array of data, from the formation of social groups to the complex protein interactions which form the basis of biological life. One approach to representing such higher order relationships is as a hypergraph. However, efforts to apply machine learning techniques to hypergraph structured datasets have been limited thus far. In this paper, we address the problem of link prediction in knowledge hypergraphs as well as simple hypergraphs and develop a novel, simple, and effective optimization architecture that addresses both tasks. Additionally, we introduce a novel feature extraction technique using node level clustering and we show how integrating data from node-level labels can improve system performance. Our self-supervised approach achieves significant improvement over state of the art baselines on several hyperedge prediction and knowledge hypergraph completion benchmarks.

LGMar 13
Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

Zelal Su, Mustafaoglu, Sungyoung Lee et al.

Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: $K$ PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural parameter space of the policy distribution via the logarithmic opinion pool. In natural parameter space, the consensus provably achieves higher KL-penalized surrogate and tighter trust region compliance than the mean expert; parameter averaging inherits these guarantees approximately. On continuous control tasks, CAPO outperforms PPO and compute-matched deeper baselines under fixed sample budgets by up to 8.6x. CAPO demonstrates that policy optimization can be improved by optimizing wider, rather than deeper, without additional environment interactions.

LGApr 17, 2025
Evolutionary Policy Optimization

Zelal Su "Lain" Mustafaoglu, Keshav Pingali, Risto Miikkulainen

A key challenge in reinforcement learning (RL) is managing the exploration-exploitation trade-off without sacrificing sample efficiency. Policy gradient (PG) methods excel in exploitation through fine-grained, gradient-based optimization but often struggle with exploration due to their focus on local search. In contrast, evolutionary computation (EC) methods excel in global exploration, but lack mechanisms for exploitation. To address these limitations, this paper proposes Evolutionary Policy Optimization (EPO), a hybrid algorithm that integrates neuroevolution with policy gradient methods for policy optimization. EPO leverages the exploration capabilities of EC and the exploitation strengths of PG, offering an efficient solution to the exploration-exploitation dilemma in RL. EPO is evaluated on the Atari Pong and Breakout benchmarks. Experimental results show that EPO improves both policy quality and sample efficiency compared to standard PG and EC methods, making it effective for tasks that require both exploration and local optimization.

DCAug 15, 2021
Sonic: A Sampling-based Online Controller for Streaming Applications

Yan Pei, Keshav Pingali

Many applications in important problem domains such as machine learning and computer vision are streaming applications that take a sequence of inputs over time. It is challenging to find knob settings that optimize the run-time performance of such applications because the optimal knob settings are usually functions of inputs, computing platforms, time as well as user's requirements, which can be very diverse. Most prior works address this problem by offline profiling followed by training models for control. However, profiling-based approaches incur large overhead before execution; it is also difficult to redeploy them in other run-time configurations. In this paper, we propose Sonic, a sampling-based online controller for long-running streaming applications that does not require profiling ahead of time. Within each phase of a streaming application's execution, Sonic utilizes the beginning portion to sample the knob space strategically and aims to pick the optimal knob setting for the rest of the phase, given a user-specified constrained optimization problem. A hybrid approach of machine learning regressions and Bayesian optimization are used for better overall sampling choices. Sonic is implemented independent of application, device, input, performance objective and constraints. We evaluate Sonic on traditional parallel benchmarks as well as on deep learning inference benchmarks across multiple platforms. Our experiments show that when using Sonic to control knob settings, application run-time performance is only 5.3% less than if optimal knob settings were used, demonstrating that Sonic is able to find near-optimal knob settings under diverse run-time configurations without prior knowledge quickly.

AIJun 16, 2021
Optimizing Graph Transformer Networks with Graph-based Techniques

Loc Hoang, Udit Agarwal, Gurbinder Gill et al.

Graph transformer networks (GTN) are a variant of graph convolutional networks (GCN) that are targeted to heterogeneous graphs in which nodes and edges have associated type information that can be exploited to improve inference accuracy. GTNs learn important metapaths in the graph, create weighted edges for these metapaths, and use the resulting graph in a GCN. Currently, the only available implementation of GTNs uses dense matrix multiplication to find metapaths. Unfortunately, the space overhead of this approach can be large, so in practice it is used only for small graphs. In addition, the matrix-based implementation is not fine-grained enough to use random-walk based methods to optimize metapath finding. In this paper, we present a graph-based formulation and implementation of the GTN metapath finding problem. This graph-based formulation has two advantages over the matrix-based approach. First, it is more space efficient than the original GTN implementation and more compute-efficient for metapath sizes of practical interest. Second, it permits us to implement a sampling method that reduces the number of metapaths that must be enumerated, allowing the implementation to be used for larger graphs and larger metapath sizes. Experimental results show that our implementation is $6.5\times$ faster than the original GTN implementation on average for a metapath length of 4, and our sampling implementation is $155\times$ faster on average than this implementation without compromising on the accuracy of the GTN.

SIMar 9, 2021
Scalable Hypergraph Embedding System

Sepideh Maleki, Donya Saless, Dennis P. Wall et al.

Many problems such as node classification and link prediction in network data can be solved using graph embeddings. However, it is difficult to use graphs to capture non-binary relations such as communities of nodes. These kinds of complex relations are expressed more naturally as hypergraphs. While hypergraphs are a generalization of graphs, state-of-the-art graph embedding techniques are not adequate for solving prediction and classification tasks on large hypergraphs accurately in reasonable time. In this paper, we introduce HyperNetVec, a novel hierarchical framework for scalable unsupervised hypergraph embedding. HyperNetVec exploits shared-memory parallelism and is capable of generating high quality embeddings for real-world hypergraphs with millions of nodes and hyperedges in only a couple of minutes while existing hypergraph systems either fail for such large hypergraphs or may take days to produce the embeddings.