ETSep 20, 2023
3SAT on an All-to-All-Connected CMOS Ising Solver ChipHüsrev Cılasun, Ziqing Zeng, Ramprasath S et al.
This work solves 3SAT, a classical NP-complete problem, on a CMOS-based Ising hardware chip with all-to-all connectivity. The paper addresses practical issues in going from algorithms to hardware. It considers several degrees of freedom in mapping the 3SAT problem to the chip - using multiple Ising formulations for 3SAT; exploring multiple strategies for decomposing large problems into subproblems that can be accommodated on the Ising chip; and executing a sequence of these subproblems on CMOS hardware to obtain the solution to the larger problem. These are evaluated within a software framework, and the results are used to identify the most promising formulations and decomposition techniques. These best approaches are then mapped to the all-to-all hardware, and the performance of 3SAT is evaluated on the chip. Experimental data shows that the deployed decomposition and mapping strategies impact SAT solution quality: without our methods, the CMOS hardware cannot achieve 3SAT solutions on SATLIB benchmarks.
ARJun 29, 2023
Performance Analysis of DNN Inference/Training with Convolution and non-Convolution OperationsHadi Esmaeilzadeh, Soroush Ghodrati, Andrew B. Kahng et al.
Today's performance analysis frameworks for deep learning accelerators suffer from two significant limitations. First, although modern convolutional neural network (CNNs) consist of many types of layers other than convolution, especially during training, these frameworks largely focus on convolution layers only. Second, these frameworks are generally targeted towards inference, and lack support for training operations. This work proposes a novel performance analysis framework, SimDIT, for general ASIC-based systolic hardware accelerator platforms. The modeling effort of SimDIT comprehensively covers convolution and non-convolution operations of both CNN inference and training on a highly parameterizable hardware substrate. SimDIT is integrated with a backend silicon implementation flow and provides detailed end-to-end performance statistics (i.e., data access cost, cycle counts, energy, and power) for executing CNN inference and training workloads. SimDIT-enabled performance analysis reveals that on a 64X64 processing array, non-convolution operations constitute 59.5% of total runtime for ResNet-50 training workload. In addition, by optimally distributing available off-chip DRAM bandwidth and on-chip SRAM resources, SimDIT achieves 18X performance improvement over a generic static resource allocation for ResNet-50 inference.
ARDec 26, 2021
A Linear-Time Algorithm for Steady-State Analysis of Electromigration in General InterconnectsMohammad Abdullah Al Shohel, Vidya A. Chhabria, Sachin S. Sapatnekar
Electromigration (EM) is a key reliability issue in deeply scaled technology nodes. Traditional EM methods first filter immortal wires using the Blech criterion, and then perform EM analysis based on Black's equation on the remaining wires. The Blech criterion is based on finding the steady-state stress in a two-terminal wire segment, but most on-chip structures are considerably more complex. Current-density-based assessment methodologies, i.e., Black's equation and the Blech criterion, which are predominantly used to detect EM-susceptible wires, do not capture the physics of EM, but alternative physics-based methods involve the solution of differential equations and are slow. This paper uses first principles, based on solving fundamental stress equations that relate electron wind and back-stress forces to the stress evolution in an interconnect, and devises a technique that analyzes any general tree or mesh interconnect structure to test for immortality. The resulting solution is extremely computationally efficient and its computation time is linear in the number of metal segments. Two variants of the method are proposed: a current-density-based method that requires traversals of the interconnect graph, and a voltage-based formulation negates the need for any traversals. The methods are applied to large interconnect networks for determining the steady-state stress at all nodes and test all segments of each network for immortality. The proposed model is applied to a variety of tree and mesh structures and is demonstrated to be fast. By construction, it is an exact solution and it is demonstrated to match much more computationally expensive numerical simulations.
3.6ARMar 15
Invited: Toward Accurate, Large-scale Electromigration Analysis and Optimization in Integrated SystemsSachin S. Sapatnekar
Electromigration, a significant lifetime reliability concern in highperformance integrated circuits, is projected to grow even more important in future heterogeneously integrated systems that will service higher current loads. Today, EM checks are primarily based on rule-based methods, but these have known limitations. In recent years, there has been remarkable progress in enabling fast EM computations based on more accurate physics-based models, but such methods have not yet moved from research to practice. This paper overviews physics-based EM models, contrasts them with empirical models, and outlines several open problems that must be solved in order to enable accurate physics-based and circuit-aware EM analysis and optimization in future integrated systems.
14.5ARMar 20Code
COmPOSER: Circuit Optimization of mm-wave/RF circuits with Performance-Oriented Synthesis for Efficient RealizationsSubhadip Ghosh, Surya Srikar Peri, Ramprasath S. et al.
This work presents COmPOSER, an open-source, end-to-end framework for RF/mm-wave design automation that translates target specifications into optimized circuits with layouts. It unifies schematic synthesis, layout generation for actives and passives, and placement/routing, incorporating physics-based equations and machine-learning-driven electromagnetic models. Based on post-layout validation on multiple LNAs and PAs operating at up to 60GHz in a commercial 65nm process-kit, COmPOSER meets performance targets, comparable to expert manual designs, while delivering a 100-300x productivity gain.
60.0ETApr 8
Computing In Spintronic Memory: A Thermal PerspectivePatrick Miller, Hüsrev Cilasun, Sachin S. Sapatnekar et al.
Computing-in-Memory (CiM) is a promising paradigm to address the memory bottleneck constraining traditional systems. Most power-efficient CiM variants can directly perform Boolean operations in non-volatile memory arrays. Higher microarchitectural activity due to CiM, however, can significantly increase power density (power per area) and result in thermal hotspots. In this paper, we provide a quantitative thermal characterization for CiM. We demonstrate that (i) the temperature remains mostly uniform due to lateral thermal conduction; (ii) the temperature increases linearly with the number of memory cells participating in computation; (iii) the temperature decreases linearly with the memory array size; (iv) the memory technology dictates the power density, hence the thermal characteristics.
LGJan 16
Extractive summarization on a CMOS Ising machineZiqing Zeng, Abhimanyu Kumar, Ahmet Efe et al.
Extractive summarization (ES) aims to generate a concise summary by selecting a subset of sentences from a document while maximizing relevance and minimizing redundancy. Although modern ES systems achieve high accuracy using powerful neural models, their deployment typically relies on CPU or GPU infrastructures that are energy-intensive and poorly suited for real-time inference in resource-constrained environments. In this work, we explore the feasibility of implementing McDonald-style extractive summarization on a low-power CMOS coupled oscillator-based Ising machine (COBI) that supports integer-valued, all-to-all spin couplings. We first propose a hardware-aware Ising formulation that reduces the scale imbalance between local fields and coupling terms, thereby improving robustness to coefficient quantization: this method can be applied to any problem formulation that requires k of n variables to be chosen. We then develop a complete ES pipeline including (i) stochastic rounding and iterative refinement to compensate for precision loss, and (ii) a decomposition strategy that partitions a large ES problem into smaller Ising subproblems that can be efficiently solved on COBI and later combined. Experimental results on the CNN/DailyMail dataset show that our pipeline can produce high-quality summaries using only integer-coupled Ising hardware with limited precision. COBI achieves 3-4.5x runtime speedups compared to a brute-force method, which is comparable to software Tabu search, and two to three orders of magnitude reductions in energy, while maintaining competitive summary quality. These results highlight the potential of deploying CMOS Ising solvers for real-time, low-energy text summarization on edge devices.
AROct 24, 2025
Accelerating Electrostatics-based Global Placement with Enhanced FFT ComputationHangyu Zhang, Sachin S. Sapatnekar
Global placement is essential for high-quality and efficient circuit placement for complex modern VLSI designs. Recent advancements, such as electrostatics-based analytic placement, have improved scalability and solution quality. This work demonstrates that using an accelerated FFT technique, AccFFT, for electric field computation significantly reduces runtime. Experimental results on standard benchmarks show significant improvements when incorporated into the ePlace-MS and Pplace-MS algorithms, e.g., a 5.78x speedup in FFT computation and a 32% total runtime improvement against ePlace-MS, with 1.0% reduction of scaled half-perimeter wirelength after detailed placement.
ETDec 21, 2023
Experimental demonstration of magnetic tunnel junction-based computational random-access memoryYang Lv, Brandon R. Zink, Robert P. Bloom et al.
Conventional computing paradigm struggles to fulfill the rapidly growing demands from emerging applications, especially those for machine intelligence, because much of the power and energy is consumed by constant data transfers between logic and memory modules. A new paradigm, called "computational random-access memory (CRAM)" has emerged to address this fundamental limitation. CRAM performs logic operations directly using the memory cells themselves, without having the data ever leave the memory. The energy and performance benefits of CRAM for both conventional and emerging applications have been well established by prior numerical studies. However, there lacks an experimental demonstration and study of CRAM to evaluate its computation accuracy, which is a realistic and application-critical metrics for its technological feasibility and competitiveness. In this work, a CRAM array based on magnetic tunnel junctions (MTJs) is experimentally demonstrated. First, basic memory operations as well as 2-, 3-, and 5-input logic operations are studied. Then, a 1-bit full adder with two different designs is demonstrated. Based on the experimental results, a suite of modeling has been developed to characterize the accuracy of CRAM computation. Scalar addition, multiplication, and matrix multiplication, which are essential building blocks for many conventional and machine intelligence applications, are evaluated and show promising accuracy performance. With the confirmation of MTJ-based CRAM's accuracy, there is a strong case that this technology will have a significant impact on power- and energy-demanding applications of machine intelligence.
ARFeb 12, 2024
IR-Aware ECO Timing Optimization Using Reinforcement LearningWenjing Jiang, Vidya A. Chhabria, Sachin S. Sapatnekar
Engineering change orders (ECOs) in late stages make minimal design fixes to recover from timing shifts due to excessive IR drops. This paper integrates IR-drop-aware timing analysis and ECO timing optimization using reinforcement learning (RL). The method operates after physical design and power grid synthesis, and rectifies IR-drop-induced timing degradation through gate sizing. It incorporates the Lagrangian relaxation (LR) technique into a novel RL framework, which trains a relational graph convolutional network (R-GCN) agent to sequentially size gates to fix timing violations. The R-GCN agent outperforms a classical LR-only algorithm: in an open 45nm technology, it (a) moves the Pareto front of the delay-power tradeoff curve to the left (b) saves runtime over the prior approaches by running fast inference using trained models, and (c) reduces the perturbation to placement by sizing fewer cells. The RL model is transferable across timing specifications and to unseen designs with fine tuning.
ARMay 11, 2023
A Machine Learning Approach to Improving Timing Consistency between Global Route and Detailed RouteVidya A. Chhabria, Wenjing Jiang, Andrew B. Kahng et al.
Due to the unavailability of routing information in design stages prior to detailed routing (DR), the tasks of timing prediction and optimization pose major challenges. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a "complete" netlist. The paper first documents that having "oracle knowledge" of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. To bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization, machine learning (ML)-based models are proposed, including the use of features for macro blockages for accurate predictions for designs with macros. Based on a set of experimental evaluations, it is demonstrated that these models show higher accuracy than GR-based timing estimation. When used during post-GR optimization, the ML-based models show demonstrable improvements in post-DR circuit performance. The methodology is applied to two different tool flows - OpenROAD and a commercial tool flow - and results on 45nm bulk and 12nm FinFET enablements show improvements in post-DR slack metrics without increasing congestion. The models are demonstrated to be generalizable to designs generated under different clock period constraints and are robust to training data with small levels of noise.
AROct 27, 2021
Encoder-Decoder Networks for Analyzing Thermal and Power Delivery NetworksVidya A. Chhabria, Vipul Ahuja, Ashwath Prabhu et al.
Power delivery network (PDN) analysis and thermal analysis are computationally expensive tasks that are essential for successful IC design. Algorithmically, both these analyses have similar computational structure and complexity as they involve the solution to a partial differential equation of the same form. This paper converts these analyses into image-to-image and sequence-to-sequence translation tasks, which allows leveraging a class of machine learning models with an encoder-decoder-based generative (EDGe) architecture to address the time-intensive nature of these tasks. For PDN analysis, we propose two networks: (i) IREDGe: a full-chip static and dynamic IR drop predictor and (ii) EMEDGe: electromigration (EM) hotspot classifier based on input power, power grid distribution, and power pad distribution patterns. For thermal analysis, we propose ThermEDGe, a full-chip static and dynamic temperature estimator based on input power distribution patterns for thermal analysis. These networks are transferable across designs synthesized within the same technology and packing solution. The networks predict on-chip IR drop, EM hotspot locations, and temperature in milliseconds with negligibly small errors against commercial tools requiring several hours.
AROct 27, 2021
OpeNPDN: A Neural-network-based Framework for Power Delivery Network SynthesisVidya A. Chhabria, Sachin S. Sapatnekar
Power delivery network (PDN) design is a nontrivial, time-intensive, and iterative task. Correct PDN design must account for considerations related to power bumps, currents, blockages, and signal congestion distribution patterns. This work proposes a machine learning-based methodology that employs a set of predefined PDN templates. At the floorplan stage, coarse estimates of current, congestion, macro/blockages, and C4 bump distributions are used to synthesize a grid for early design. At the placement stage, the grid is incrementally refined based on more accurate and fine-grained distributions of current and congestion. At each stage, a convolutional neural network (CNN) selects an appropriate PDN template for each region on the chip, building a safe-by-construction PDN that meets IR drop and electromigration (EM) specifications. The CNN is initially trained using a large synthetically-created dataset, following which transfer learning is leveraged to bridge the gap between real-circuit data (with a limited dataset size) and synthetically-generated data. On average, the optimization of the PDN frees thousands of routing tracks in congestion-critical regions, when compared to a globally uniform PDN, while staying within the IR drop and EM limits.
ARMay 21, 2021
GNNIE: GNN Inference Engine with Load-balancing and Graph-Specific CachingSudipta Mondal, Susmita Dey Manasi, Kishor Kunal et al.
Graph neural networks (GNN) analysis engines are vital for real-world problems that use large graph models. Challenges for a GNN hardware platform include the ability to (a) host a variety of GNNs, (b) handle high sparsity in input vertex feature vectors and the graph adjacency matrix and the accompanying random memory access patterns, and (c) maintain load-balanced computation in the face of uneven workloads, induced by high sparsity and power-law vertex degree distributions. This paper proposes GNNIE, an accelerator designed to run a broad range of GNNs. It tackles workload imbalance by (i)~splitting vertex feature operands into blocks, (ii)~reordering and redistributing computations, (iii)~using a novel flexible MAC architecture. It adopts a graph-specific, degree-aware caching policy that is well suited to real-world graph characteristics. The policy enhances on-chip data reuse and avoids random memory access to DRAM. GNNIE achieves average speedups of 21233x over a CPU and 699x over a GPU over multiple datasets on graph attention networks (GATs), graph convolutional networks (GCNs), GraphSAGE, GINConv, and DiffPool. Compared to prior approaches, GNNIE achieves an average speedup of 35x over HyGCN (which cannot implement GATs) for GCN, GraphSAGE, and GINConv, and, using 3.4x fewer processing units, an average speedup of 2.1x over AWB-GCN (which runs only GCNs).
LGSep 30, 2020
A general approach for identifying hierarchical symmetry constraints for analog circuit layoutKishor Kunal, Jitesh Poojary, Tonmoy Dhar et al.
Analog layout synthesis requires some elements in the circuit netlist to be matched and placed symmetrically. However, the set of symmetries is very circuit-specific and a versatile algorithm, applicable to a broad variety of circuits, has been elusive. This paper presents a general methodology for the automated generation of symmetry constraints, and applies these constraints to guide automated layout synthesis. While prior approaches were restricted to identifying simple symmetries, the proposed method operates hierarchically and uses graph-based algorithms to extract multiple axes of symmetry within a circuit. An important ingredient of the algorithm is its ability to identify arrays of repeated structures. In some circuits, the repeated structures are not perfect replicas and can only be found through approximate graph matching. A fast graph neural network based methodology is developed for this purpose, based on evaluating the graph edit distance. The utility of this algorithm is demonstrated on a variety of circuits, including operational amplifiers, data converters, equalizers, and low-noise amplifiers.
ARSep 18, 2020
Thermal and IR Drop Analysis Using Convolutional Encoder-Decoder NetworksVidya A. Chhabria, Vipul Ahuja, Ashwath Prabhu et al.
Computationally expensive temperature and power grid analyses are required during the design cycle to guide IC design. This paper employs encoder-decoder based generative (EDGe) networks to map these analyses to fast and accurate image-to-image and sequence-to-sequence translation tasks. The network takes a power map as input and outputs the corresponding temperature or IR drop map. We propose two networks: (i) ThermEDGe: a static and dynamic full-chip temperature estimator and (ii) IREDGe: a full-chip static IR drop predictor based on input power, power grid distribution, and power pad distribution patterns. The models are design-independent and must be trained just once for a particular technology and packaging solution. ThermEDGe and IREDGe are demonstrated to rapidly predict the on-chip temperature and IR drop contours in milliseconds (in contrast with commercial tools that require several hours or more) and provide an average error of 0.6% and 0.008% respectively.
NASep 24, 2006
Stochastic Preconditioning for Iterative Linear Equation SolversHaifeng Qian, Sachin S. Sapatnekar
This paper presents a new stochastic preconditioning approach. For symmetric diagonally-dominant M-matrices, we prove that an incomplete LDL factorization can be obtained from random walks, and used as a preconditioner for an iterative solver, e.g., conjugate gradient. It is argued that our factor matrices have better quality, i.e., better accuracy-size tradeoffs, than preconditioners produced by existing incomplete factorization methods. Therefore the resulting preconditioned conjugate gradient (PCG) method requires less computation than traditional PCG methods to solve a set of linear equations with the same error tolerance, and the advantage increases for larger and denser sets of linear equations. These claims are verified by numerical tests, and we provide techniques that can potentially extend the theory to more general types of matrices.