Sachin S. Sapatnekar

h-index60

15papers

198citations

Novelty44%

AI Score45

Ranked #40,393 of 194,257 authors (top 21%)#119 in AR (top 19%)

15 Papers

4.3ETSep 20, 2023

3SAT on an All-to-All-Connected CMOS Ising Solver Chip

Hüsrev Cılasun, Ziqing Zeng, Ramprasath S et al.

This work solves 3SAT, a classical NP-complete problem, on a CMOS-based Ising hardware chip with all-to-all connectivity. The paper addresses practical issues in going from algorithms to hardware. It considers several degrees of freedom in mapping the 3SAT problem to the chip - using multiple Ising formulations for 3SAT; exploring multiple strategies for decomposing large problems into subproblems that can be accommodated on the Ising chip; and executing a sequence of these subproblems on CMOS hardware to obtain the solution to the larger problem. These are evaluated within a software framework, and the results are used to identify the most promising formulations and decomposition techniques. These best approaches are then mapped to the all-to-all hardware, and the performance of 3SAT is evaluated on the chip. Experimental data shows that the deployed decomposition and mapping strategies impact SAT solution quality: without our methods, the CMOS hardware cannot achieve 3SAT solutions on SATLIB benchmarks.

3.8LGAug 23, 2023Code

An Open-Source ML-Based Full-Stack Optimization Framework for Machine Learning Accelerators

Hadi Esmaeilzadeh, Soroush Ghodrati, Andrew B. Kahng et al.

Parameterizable machine learning (ML) accelerators are the product of recent breakthroughs in ML. To fully enable their design space exploration (DSE), we propose a physical-design-driven, learning-based prediction framework for hardware-accelerated deep neural network (DNN) and non-DNN ML algorithms. It adopts a unified approach that combines backend power, performance, and area (PPA) analysis with frontend performance simulation, thereby achieving a realistic estimation of both backend PPA and system metrics such as runtime and energy. In addition, our framework includes a fully automated DSE technique, which optimizes backend and system metrics through an automated search of architectural and backend parameters. Experimental studies show that our approach consistently predicts backend PPA and system metrics with an average 7% or less prediction error for the ASIC implementation of two deep learning accelerator platforms, VTA and VeriGOOD-ML, in both a commercial 12 nm process and a research-oriented 45 nm process.

1.2ARJun 29, 2023

Performance Analysis of DNN Inference/Training with Convolution and non-Convolution Operations

Hadi Esmaeilzadeh, Soroush Ghodrati, Andrew B. Kahng et al.

Today's performance analysis frameworks for deep learning accelerators suffer from two significant limitations. First, although modern convolutional neural network (CNNs) consist of many types of layers other than convolution, especially during training, these frameworks largely focus on convolution layers only. Second, these frameworks are generally targeted towards inference, and lack support for training operations. This work proposes a novel performance analysis framework, SimDIT, for general ASIC-based systolic hardware accelerator platforms. The modeling effort of SimDIT comprehensively covers convolution and non-convolution operations of both CNN inference and training on a highly parameterizable hardware substrate. SimDIT is integrated with a backend silicon implementation flow and provides detailed end-to-end performance statistics (i.e., data access cost, cycle counts, energy, and power) for executing CNN inference and training workloads. SimDIT-enabled performance analysis reveals that on a 64X64 processing array, non-convolution operations constitute 59.5% of total runtime for ResNet-50 training workload. In addition, by optimally distributing available off-chip DRAM bandwidth and on-chip SRAM resources, SimDIT achieves 18X performance improvement over a generic static resource allocation for ResNet-50 inference.

2.3ARDec 26, 2021

A Linear-Time Algorithm for Steady-State Analysis of Electromigration in General Interconnects

Mohammad Abdullah Al Shohel, Vidya A. Chhabria, Sachin S. Sapatnekar

Electromigration (EM) is a key reliability issue in deeply scaled technology nodes. Traditional EM methods first filter immortal wires using the Blech criterion, and then perform EM analysis based on Black's equation on the remaining wires. The Blech criterion is based on finding the steady-state stress in a two-terminal wire segment, but most on-chip structures are considerably more complex. Current-density-based assessment methodologies, i.e., Black's equation and the Blech criterion, which are predominantly used to detect EM-susceptible wires, do not capture the physics of EM, but alternative physics-based methods involve the solution of differential equations and are slow. This paper uses first principles, based on solving fundamental stress equations that relate electron wind and back-stress forces to the stress evolution in an interconnect, and devises a technique that analyzes any general tree or mesh interconnect structure to test for immortality. The resulting solution is extremely computationally efficient and its computation time is linear in the number of metal segments. Two variants of the method are proposed: a current-density-based method that requires traversals of the interconnect graph, and a voltage-based formulation negates the need for any traversals. The methods are applied to large interconnect networks for determining the steady-state stress at all nodes and test all segments of each network for immortality. The proposed model is applied to a variety of tree and mesh structures and is demonstrated to be fast. By construction, it is an exact solution and it is demonstrated to match much more computationally expensive numerical simulations.

6.3ARMar 15

Invited: Toward Accurate, Large-scale Electromigration Analysis and Optimization in Integrated Systems

Sachin S. Sapatnekar

Electromigration, a significant lifetime reliability concern in highperformance integrated circuits, is projected to grow even more important in future heterogeneously integrated systems that will service higher current loads. Today, EM checks are primarily based on rule-based methods, but these have known limitations. In recent years, there has been remarkable progress in enabling fast EM computations based on more accurate physics-based models, but such methods have not yet moved from research to practice. This paper overviews physics-based EM models, contrasts them with empirical models, and outlines several open problems that must be solved in order to enable accurate physics-based and circuit-aware EM analysis and optimization in future integrated systems.

1.4LGJan 16

Extractive summarization on a CMOS Ising machine

Ziqing Zeng, Abhimanyu Kumar, Ahmet Efe et al.

Extractive summarization (ES) aims to generate a concise summary by selecting a subset of sentences from a document while maximizing relevance and minimizing redundancy. Although modern ES systems achieve high accuracy using powerful neural models, their deployment typically relies on CPU or GPU infrastructures that are energy-intensive and poorly suited for real-time inference in resource-constrained environments. In this work, we explore the feasibility of implementing McDonald-style extractive summarization on a low-power CMOS coupled oscillator-based Ising machine (COBI) that supports integer-valued, all-to-all spin couplings. We first propose a hardware-aware Ising formulation that reduces the scale imbalance between local fields and coupling terms, thereby improving robustness to coefficient quantization: this method can be applied to any problem formulation that requires k of n variables to be chosen. We then develop a complete ES pipeline including (i) stochastic rounding and iterative refinement to compensate for precision loss, and (ii) a decomposition strategy that partitions a large ES problem into smaller Ising subproblems that can be efficiently solved on COBI and later combined. Experimental results on the CNN/DailyMail dataset show that our pipeline can produce high-quality summaries using only integer-coupled Ising hardware with limited precision. COBI achieves 3-4.5x runtime speedups compared to a brute-force method, which is comparable to software Tabu search, and two to three orders of magnitude reductions in energy, while maintaining competitive summary quality. These results highlight the potential of deploying CMOS Ising solvers for real-time, low-energy text summarization on edge devices.

1.2AROct 24, 2025

Accelerating Electrostatics-based Global Placement with Enhanced FFT Computation

Hangyu Zhang, Sachin S. Sapatnekar

Global placement is essential for high-quality and efficient circuit placement for complex modern VLSI designs. Recent advancements, such as electrostatics-based analytic placement, have improved scalability and solution quality. This work demonstrates that using an accelerated FFT technique, AccFFT, for electric field computation significantly reduces runtime. Experimental results on standard benchmarks show significant improvements when incorporated into the ePlace-MS and Pplace-MS algorithms, e.g., a 5.78x speedup in FFT computation and a 32% total runtime improvement against ePlace-MS, with 1.0% reduction of scaled half-perimeter wirelength after detailed placement.

2.3ETDec 21, 2023

Experimental demonstration of magnetic tunnel junction-based computational random-access memory

Yang Lv, Brandon R. Zink, Robert P. Bloom et al.

Conventional computing paradigm struggles to fulfill the rapidly growing demands from emerging applications, especially those for machine intelligence, because much of the power and energy is consumed by constant data transfers between logic and memory modules. A new paradigm, called "computational random-access memory (CRAM)" has emerged to address this fundamental limitation. CRAM performs logic operations directly using the memory cells themselves, without having the data ever leave the memory. The energy and performance benefits of CRAM for both conventional and emerging applications have been well established by prior numerical studies. However, there lacks an experimental demonstration and study of CRAM to evaluate its computation accuracy, which is a realistic and application-critical metrics for its technological feasibility and competitiveness. In this work, a CRAM array based on magnetic tunnel junctions (MTJs) is experimentally demonstrated. First, basic memory operations as well as 2-, 3-, and 5-input logic operations are studied. Then, a 1-bit full adder with two different designs is demonstrated. Based on the experimental results, a suite of modeling has been developed to characterize the accuracy of CRAM computation. Scalar addition, multiplication, and matrix multiplication, which are essential building blocks for many conventional and machine intelligence applications, are evaluated and show promising accuracy performance. With the confirmation of MTJ-based CRAM's accuracy, there is a strong case that this technology will have a significant impact on power- and energy-demanding applications of machine intelligence.

2.3ARFeb 12, 2024

IR-Aware ECO Timing Optimization Using Reinforcement Learning

Wenjing Jiang, Vidya A. Chhabria, Sachin S. Sapatnekar

Engineering change orders (ECOs) in late stages make minimal design fixes to recover from timing shifts due to excessive IR drops. This paper integrates IR-drop-aware timing analysis and ECO timing optimization using reinforcement learning (RL). The method operates after physical design and power grid synthesis, and rectifies IR-drop-induced timing degradation through gate sizing. It incorporates the Lagrangian relaxation (LR) technique into a novel RL framework, which trains a relational graph convolutional network (R-GCN) agent to sequentially size gates to fix timing violations. The R-GCN agent outperforms a classical LR-only algorithm: in an open 45nm technology, it (a) moves the Pareto front of the delay-power tradeoff curve to the left (b) saves runtime over the prior approaches by running fast inference using trained models, and (c) reduces the perturbation to placement by sizing fewer cells. The RL model is transferable across timing specifications and to unseen designs with fine tuning.

3.8CRDec 10, 2021

Towards Homomorphic Inference Beyond the Edge

Salonik Resch, Zamshed I. Chowdhury, Husrev Cilasun et al.

Beyond edge devices can function off the power grid and without batteries, enabling them to operate in difficult to access regions. However, energy costly long-distance communication required for reporting results or offloading computation becomes a limitation. Here, we reduce this overhead by developing a beyond edge device which can effectively act as a nearby server to offload computation. For security reasons, this device must operate on encrypted data, which incurs a high overhead. We use energy-efficient and intermittent-safe in-memory computation to enable this encrypted computation, allowing it to provide a speedup for beyond edge applications within a power budget of a few milliWatts.

4.3AROct 27, 2021

Encoder-Decoder Networks for Analyzing Thermal and Power Delivery Networks

Vidya A. Chhabria, Vipul Ahuja, Ashwath Prabhu et al.

Power delivery network (PDN) analysis and thermal analysis are computationally expensive tasks that are essential for successful IC design. Algorithmically, both these analyses have similar computational structure and complexity as they involve the solution to a partial differential equation of the same form. This paper converts these analyses into image-to-image and sequence-to-sequence translation tasks, which allows leveraging a class of machine learning models with an encoder-decoder-based generative (EDGe) architecture to address the time-intensive nature of these tasks. For PDN analysis, we propose two networks: (i) IREDGe: a full-chip static and dynamic IR drop predictor and (ii) EMEDGe: electromigration (EM) hotspot classifier based on input power, power grid distribution, and power pad distribution patterns. For thermal analysis, we propose ThermEDGe, a full-chip static and dynamic temperature estimator based on input power distribution patterns for thermal analysis. These networks are transferable across designs synthesized within the same technology and packing solution. The networks predict on-chip IR drop, EM hotspot locations, and temperature in milliseconds with negligibly small errors against commercial tools requiring several hours.

3.3AROct 27, 2021

OpeNPDN: A Neural-network-based Framework for Power Delivery Network Synthesis

Vidya A. Chhabria, Sachin S. Sapatnekar

Power delivery network (PDN) design is a nontrivial, time-intensive, and iterative task. Correct PDN design must account for considerations related to power bumps, currents, blockages, and signal congestion distribution patterns. This work proposes a machine learning-based methodology that employs a set of predefined PDN templates. At the floorplan stage, coarse estimates of current, congestion, macro/blockages, and C4 bump distributions are used to synthesize a grid for early design. At the placement stage, the grid is incrementally refined based on more accurate and fine-grained distributions of current and congestion. At each stage, a convolutional neural network (CNN) selects an appropriate PDN template for each region on the chip, building a safe-by-construction PDN that meets IR drop and electromigration (EM) specifications. The CNN is initially trained using a large synthetically-created dataset, following which transfer learning is leveraged to bridge the gap between real-circuit data (with a limited dataset size) and synthetically-generated data. On average, the optimization of the PDN frees thousands of routing tracks in congestion-critical regions, when compared to a globally uniform PDN, while staying within the IR drop and EM limits.

7.9LGSep 30, 2020

A general approach for identifying hierarchical symmetry constraints for analog circuit layout

Kishor Kunal, Jitesh Poojary, Tonmoy Dhar et al.

Analog layout synthesis requires some elements in the circuit netlist to be matched and placed symmetrically. However, the set of symmetries is very circuit-specific and a versatile algorithm, applicable to a broad variety of circuits, has been elusive. This paper presents a general methodology for the automated generation of symmetry constraints, and applies these constraints to guide automated layout synthesis. While prior approaches were restricted to identifying simple symmetries, the proposed method operates hierarchically and uses graph-based algorithms to extract multiple axes of symmetry within a circuit. An important ingredient of the algorithm is its ability to identify arrays of repeated structures. In some circuits, the repeated structures are not perfect replicas and can only be found through approximate graph matching. A fast graph neural network based methodology is developed for this purpose, based on evaluating the graph edit distance. The utility of this algorithm is demonstrated on a variety of circuits, including operational amplifiers, data converters, equalizers, and low-noise amplifiers.

10.3ARSep 18, 2020Code

Thermal and IR Drop Analysis Using Convolutional Encoder-Decoder Networks

Vidya A. Chhabria, Vipul Ahuja, Ashwath Prabhu et al.

Computationally expensive temperature and power grid analyses are required during the design cycle to guide IC design. This paper employs encoder-decoder based generative (EDGe) networks to map these analyses to fast and accurate image-to-image and sequence-to-sequence translation tasks. The network takes a power map as input and outputs the corresponding temperature or IR drop map. We propose two networks: (i) ThermEDGe: a static and dynamic full-chip temperature estimator and (ii) IREDGe: a full-chip static IR drop predictor based on input power, power grid distribution, and power pad distribution patterns. The models are design-independent and must be trained just once for a particular technology and packaging solution. ThermEDGe and IREDGe are demonstrated to rapidly predict the on-chip temperature and IR drop contours in milliseconds (in contrast with commercial tools that require several hours or more) and provide an average error of 0.6% and 0.008% respectively.

1.2NASep 24, 2006

Stochastic Preconditioning for Iterative Linear Equation Solvers

Haifeng Qian, Sachin S. Sapatnekar

This paper presents a new stochastic preconditioning approach. For symmetric diagonally-dominant M-matrices, we prove that an incomplete LDL factorization can be obtained from random walks, and used as a preconditioner for an iterative solver, e.g., conjugate gradient. It is argued that our factor matrices have better quality, i.e., better accuracy-size tradeoffs, than preconditioners produced by existing incomplete factorization methods. Therefore the resulting preconditioned conjugate gradient (PCG) method requires less computation than traditional PCG methods to solve a set of linear equations with the same error tolerance, and the advantage increases for larger and denser sets of linear equations. These claims are verified by numerical tests, and we provide techniques that can potentially extend the theory to more general types of matrices.