LGMar 13, 2023
X-Former: In-Memory Acceleration of TransformersShrihari Sridharan, Jacob R. Stevens, Kaushik Roy et al.
Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the attention mechanism, which assigns an importance score for every word relative to other words in a sequence. However, these models are very large, often reaching hundreds of billions of parameters, and therefore require a large number of DRAM accesses. Hence, traditional deep neural network (DNN) accelerators such as GPUs and TPUs face limitations in processing Transformers efficiently. In-memory accelerators based on non-volatile memory promise to be an effective solution to this challenge, since they provide high storage density while performing massively parallel matrix vector multiplications within memory arrays. However, attention score computations, which are frequently used in Transformers (unlike CNNs and RNNs), require matrix vector multiplications (MVM) where both operands change dynamically for each input. As a result, conventional NVM-based accelerators incur high write latency and write energy when used for Transformers, and further suffer from the low endurance of most NVM technologies. To address these challenges, we present X-Former, a hybrid in-memory hardware accelerator that consists of both NVM and CMOS processing elements to execute transformer workloads efficiently. To improve the hardware utilization of X-Former, we also propose a sequence blocking dataflow, which overlaps the computations of the two processing elements and reduces execution time. Across several benchmarks, we show that X-Former achieves upto 85x and 7.5x improvements in latency and energy over a NVIDIA GeForce GTX 1060 GPU and upto 10.7x and 4.6x improvements in latency and energy over a state-of-the-art in-memory NVM accelerator.
LGDec 7, 2025
KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language ModelsSourjya Roy, Shrihari Sridharan, Surya Selvam et al.
As Large Language Models (LLMs) scale in size and context length, the memory requirements of the key value (KV) cache have emerged as a major bottleneck during autoregressive decoding. The KV cache grows with sequence length and embedding dimension, often exceeding the memory footprint of the model itself and limiting achievable batch sizes and context windows. To address this challenge, we present KV CAR, a unified and architecture agnostic framework that significantly reduces KV cache storage while maintaining model fidelity. KV CAR combines two complementary techniques. First, a lightweight autoencoder learns compact representations of key and value tensors along the embedding dimension, compressing them before they are stored in the KV cache and restoring them upon retrieval. Second, a similarity driven reuse mechanism identifies opportunities to reuse KV tensors of specific attention heads across adjacent layers. Together, these methods reduce the dimensional and structural redundancy in KV tensors without requiring changes to the transformer architecture. Evaluations on GPT 2 and TinyLLaMA models across Wikitext, C4, PIQA, and Winogrande datasets demonstrate that KV CAR achieves up to 47.85 percent KV cache memory reduction with minimal impact on perplexity and zero shot accuracy. System level measurements on an NVIDIA A40 GPU show that the reduced KV footprint directly translates into longer sequence lengths and larger batch sizes during inference. These results highlight the effectiveness of KV CAR in enabling memory efficient LLM inference.
LGDec 7, 2025
GradientSpace: Unsupervised Data Clustering for Improved Instruction TuningShrihari Sridharan, Deepak Ravikumar, Anand Raghunathan et al.
Instruction tuning is one of the key steps required for adapting large language models (LLMs) to a broad spectrum of downstream applications. However, this procedure is difficult because real-world datasets are rarely homogeneous; they consist of a mixture of diverse information, causing gradient interference, where conflicting gradients pull the model in opposing directions, degrading performance. A common strategy to mitigate this issue is to group data based on semantic or embedding similarity. However, this fails to capture how data influences model parameters during learning. While recent works have attempted to cluster gradients directly, they randomly project gradients into lower dimensions to manage memory, which leads to accuracy loss. Moreover, these methods rely on expert ensembles which necessitates multiple inference passes and expensive on-the-fly gradient computations during inference. To address these limitations, we propose GradientSpace, a framework that clusters samples directly in full-dimensional gradient space. We introduce an online SVD-based algorithm that operates on LoRA gradients to identify latent skills without the infeasible cost of storing all sample gradients. Each cluster is used to train a specialized LoRA expert along with a lightweight router trained to select the best expert during inference. We show that routing to a single, appropriate expert outperforms expert ensembles used in prior work, while significantly reducing inference latency. Our experiments across mathematical reasoning, code generation, finance, and creative writing tasks demonstrate that GradientSpace leads to coherent expert specialization and consistent accuracy gains over state-of-the-art clustering methods and finetuning techniques.
LGMar 23, 2024
Ev-Edge: Efficient Execution of Event-based Vision Algorithms on Commodity Edge PlatformsShrihari Sridharan, Surya Selvam, Kaushik Roy et al.
Event cameras have emerged as a promising sensing modality for autonomous navigation systems, owing to their high temporal resolution, high dynamic range and negligible motion blur. To process the asynchronous temporal event streams from such sensors, recent research has shown that a mix of Artificial Neural Networks (ANNs), Spiking Neural Networks (SNNs) as well as hybrid SNN-ANN algorithms are necessary to achieve high accuracies across a range of perception tasks. However, we observe that executing such workloads on commodity edge platforms which feature heterogeneous processing elements such as CPUs, GPUs and neural accelerators results in inferior performance. This is due to the mismatch between the irregular nature of event streams and diverse characteristics of algorithms on the one hand and the underlying hardware platform on the other. We propose Ev-Edge, a framework that contains three key optimizations to boost the performance of event-based vision systems on edge platforms: (1) An Event2Sparse Frame converter directly transforms raw event streams into sparse frames, enabling the use of sparse libraries with minimal encoding overheads (2) A Dynamic Sparse Frame Aggregator merges sparse frames at runtime by trading off the temporal granularity of events and computational demand thereby improving hardware utilization (3) A Network Mapper maps concurrently executing tasks to different processing elements while also selecting layer precision by considering both compute and communication overheads. On several state-of-art networks for a range of autonomous navigation tasks, Ev-Edge achieves 1.28x-2.05x improvements in latency and 1.23x-2.15x in energy over an all-GPU implementation on the NVIDIA Jetson Xavier AGX platform for single-task execution scenarios. Ev-Edge also achieves 1.43x-1.81x latency improvements over round-robin scheduling methods in multi-task execution scenarios.
LGNov 28, 2025
Experts are all you need: A Composable Framework for Large Language Model InferenceShrihari Sridharan, Sourjya Roy, Anand Raghunathan et al.
Large Language Models (LLMs) have achieved state-of-the-art accuracies in a variety of natural language processing (NLP) tasks. However, this success comes at the cost of increased model sizes which leads to additional computational burden. Mixture of Experts (MoEs) overcome this bottleneck by decoupling model capacity from computation by only activating a subset of parameters or "experts". However, these models require joint pretraining of these experts along with the router and do not model multi-step reasoning. In contrast, multi-agent frameworks improve reasoning by decomposing complex problems into modular subtasks. However, these frameworks rely on sequential "plan--act--observe" loops, which introduce significant latency. Our work, Comp-LLM, addresses these challenges by introducing a composable inference framework that enables cross-expert collaboration via an explicit sub-query dependency graph. Comp-LLM consists of three components: (1) A Sub-query Generator that decomposes an input query, assigns each sub-query to an appropriate expert using embedding similarity, and constructs a dependency graph; (2) A Query Executor that processes nodes in the graph and identifies opportunities for parallelism based on dependencies and resource constraints; and (3) A Response Aggregator that synthesizes intermediate expert responses into a coherent final answer. Across several benchmarks, Comp-LLM achieves up to 11.01% accuracy improvement over monolithic LLMs of similar size, while offering 1.67x--3.56x reduction in model size with no significant degradation relative to the largest model in its family. Additionally, Comp-LLM provides 1.1x--1.7x latency improvement compared to sequential sub-query processing.
LGFeb 25, 2020
TxSim:Modeling Training of Deep Neural Networks on Resistive Crossbar SystemsSourjya Roy, Shrihari Sridharan, Shubham Jain et al.
Resistive crossbars have attracted significant interest in the design of Deep Neural Network (DNN) accelerators due to their ability to natively execute massively parallel vector-matrix multiplications within dense memory arrays. However, crossbar-based computations face a major challenge due to a variety of device and circuit-level non-idealities, which manifest as errors in the vector-matrix multiplications and eventually degrade DNN accuracy. To address this challenge, there is a need for tools that can model the functional impact of non-idealities on DNN training and inference. Existing efforts towards this goal are either limited to inference, or are too slow to be used for large-scale DNN training. We propose TxSim, a fast and customizable modeling framework to functionally evaluate DNN training on crossbar-based hardware considering the impact of non-idealities. The key features of TxSim that differentiate it from prior efforts are: (i) It comprehensively models non-idealities during all training operations (forward propagation, backward propagation, and weight update) and (ii) it achieves computational efficiency by mapping crossbar evaluations to well-optimized BLAS routines and incorporates speedup techniques to further reduce simulation time with minimal impact on accuracy. TxSim achieves orders-of-magnitude improvement in simulation speed over prior works, and thereby makes it feasible to evaluate training of large-scale DNNs on crossbars. Our experiments using TxSim reveal that the accuracy degradation in DNN training due to non-idealities can be substantial (3%-10%) for large-scale DNNs, underscoring the need for further research in mitigation techniques. We also analyze the impact of various device and circuit-level parameters and the associated non-idealities to provide key insights that can guide the design of crossbar-based DNN training accelerators.