Santosh Pandey

h-index13

4papers

239citations

Novelty61%

AI Score35

Ranked #108,799 of 194,257 authors (top 56%)#378 in AR (top 59%)

4 Papers

5.1ARMar 29, 2025Code

Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion

Arash Nasr-Esfahany, Mohammad Alizadeh, Victor Lee et al.

Cycle-level simulators such as gem5 are widely used in microarchitecture design, but they are prohibitively slow for large-scale design space explorations. We present Concorde, a new methodology for learning fast and accurate performance models of microarchitectures. Unlike existing simulators and learning approaches that emulate each instruction, Concorde predicts the behavior of a program based on compact performance distributions that capture the impact of different microarchitectural components. It derives these performance distributions using simple analytical models that estimate bounds on performance induced by each microarchitectural component, providing a simple yet rich representation of a program's performance characteristics across a large space of microarchitectural parameters. Experiments show that Concorde is more than five orders of magnitude faster than a reference cycle-level simulator, with about 2% average Cycles-Per-Instruction (CPI) prediction error across a range of SPEC, open-source, and proprietary benchmarks. This enables rapid design-space exploration and performance sensitivity analyses that are currently infeasible, e.g., in about an hour, we conducted a first-of-its-kind fine-grained performance attribution to different microarchitectural components across a diverse set of programs, requiring nearly 150 million CPI evaluations.

4.3ARApr 16, 2024

Tao: Re-Thinking DL-based Microarchitecture Simulation

Santosh Pandey, Amir Yazdanbakhsh, Hang Liu

Microarchitecture simulators are indispensable tools for microarchitecture designers to validate, estimate, and optimize new hardware that meets specific design requirements. While the quest for a fast, accurate and detailed microarchitecture simulation has been ongoing for decades, existing simulators excel and fall short at different aspects: (i) Although execution-driven simulation is accurate and detailed, it is extremely slow and requires expert-level experience to design. (ii) Trace-driven simulation reuses the execution traces in pursuit of fast simulation but faces accuracy concerns and fails to achieve significant speedup. (iii) Emerging deep learning (DL)-based simulations are remarkably fast and have acceptable accuracy but fail to provide adequate low-level microarchitectural performance metrics crucial for microarchitectural bottleneck analysis. Additionally, they introduce substantial overheads from trace regeneration and model re-training when simulating a new microarchitecture. Re-thinking the advantages and limitations of the aforementioned simulation paradigms, this paper introduces TAO that redesigns the DL-based simulation with three primary contributions: First, we propose a new training dataset design such that the subsequent simulation only needs functional trace as inputs, which can be rapidly generated and reused across microarchitectures. Second, we redesign the input features and the DL model using self-attention to support predicting various performance metrics. Third, we propose techniques to train a microarchitecture agnostic embedding layer that enables fast transfer learning between different microarchitectural configurations and reduces the re-training overhead of conventional DL-based simulators. Our extensive evaluation shows TAO can reduce the overall training and simulation time by 18.06x over the state-of-the-art DL-based endeavors.

18.6DCJul 16, 2020

FTRANS: Energy-Efficient Acceleration of Transformers using FPGA

Bingbing Li, Santosh Pandey, Haowen Fang et al.

In natural language processing (NLP), the "Transformer" architecture was proposed as the first transduction model replying entirely on self-attention mechanisms without using sequence-aligned recurrent neural networks (RNNs) or convolution, and it achieved significant improvements for sequence to sequence tasks. The introduced intensive computation and storage of these pre-trained language representations has impeded their popularity into computation and memory-constrained devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its high parallelism and low latency. However, the trained models are still too large to accommodate to an FPGA fabric. In this paper, we propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations. Our framework includes enhanced block-circulant matrix (BCM)-based weight representation to enable model compression on large-scale language representations at the algorithm level with few accuracy degradation, and an acceleration design at the architecture level. Experimental results show that our proposed framework significantly reduces the model size of NLP models by up to 16 times. Our FPGA design achieves 27.07x and 81x improvement in performance and energy efficiency compared to CPU, and up to 8.80x improvement in energy efficiency compared to GPU.

3.1CRMar 4, 2016

Centralized group key management scheme for secure multicast communication without re-keying

Vinod Kumar, S. K. Pandey, Rajendra Kumar

In the secure group communication, data is transmitted in such a way that only the group members are able to receive the messages. The main problem in the solution using symmetric key is heavy re-keying cost. To reduce re-keying cost tree based architecture is used. But it requires extra overhead to balance the key- tree in order to achieve logarithmic re-keying cost. The main challenging issue in dynamic and secure multimedia multicast communication is to design a centralized group key management scheme with minimal computational, communicational and storages complexities without breaching security issues. Several authors have proposed different centralized group key management schemes, wherein one of them proposes reducing communicational complexity but increases computational and storage costs however another proposes decreasing the computational and storage costs which eventually breaches forward and backward secrecy. In this paper we propose a comparatively more efficient centralized group key management scheme that not only minimize the computational, communicational and storages complexities but also maintaining the security at the optimal level. The message encryptions and decryptions costs are also minimized. Further, we also provide an extended multicast scheme, in which the several requests towards leaving or joining the group can be done by large number of members simultaneously. In order to obtain better performance of multicast encryption, the symmetric-key and asymmetric-key cryptosystems may be combined.