SEMay 18
DEFT: Differentiable Automatic Test Pattern GenerationWei Li, Yang Zou, Yixin Liang et al.
Modern IC complexity drives test pattern growth, with the majority of patterns targeting a small set of hard-to-detect (HTD) faults. This motivates new ATPG algorithms to improve test effectiveness specifically for HTD faults. This paper presents DEFT (Differentiable Automatic Test Pattern Generation), a new ATPG approach that reformulates the discrete ATPG problem as a continuous optimization task. DEFT introduces a mathematically grounded reparameterization that aligns the expected continuous objective with discrete fault-detection semantics, enabling reliable gradient-based pattern generation. To ensure scalability and stability on deep circuit graphs, DEFT integrates a custom CUDA kernel for efficient forward-backward propagation and applies gradient normalization to mitigate vanishing gradients. Compared to a leading commercial tool on a wide range of benchmarks, DEFT reduced the pattern count by 27.3% on average and by up to 75.9%. DEFT also supports practical ATPG settings such as partial assignment pattern generation, producing patterns with 19.3% fewer 0/1 bits while still detecting 35% more faults. These results indicate DEFT is a promising and effective ATPG engine, offering a valuable complement to existing heuristics.
ETMar 18
Probabilistic approximate optimization using single-photon avalanche diode arraysZiyad Alswaidan, Abdelrahman S. Abdelrahman, Md Sakibur Sajal et al.
Combinatorial optimization problems are central to science and engineering and specialized hardware from quantum annealers to classical Ising machines are being actively developed to address them. These systems typically sample from a fixed energy landscape defined by the problem Hamiltonian encoding the discrete optimization problem. The recently introduced Probabilistic Approximate Optimization Algorithm (PAOA) takes a different approach: it treats the optimization landscape itself as variational, iteratively learning circuit parameters from samples. Here, we demonstrate PAOA on a 64$\times$64 perimeter-gated single-photon avalanche diode (pgSPAD) array fabricated in 0.35 $μ$m CMOS, the first realization of the algorithm using intrinsically stochastic nanodevices. Each p-bit exhibits a device-specific, asymmetric (Gompertz-type) activation function due to dark-count variability. Rather than calibrating devices to enforce a uniform symmetric (logistic/tanh) activation, PAOA learns around device variations, absorbing residual activation and other mismatches into the variational parameters. On canonical 26-spin Sherrington-Kirkpatrick instances, PAOA achieves high approximation ratios with $2p$ parameters ($p$ up to 17 layers), and pgSPAD-based inference closely tracks CPU simulations. These results show that variational learning can accommodate the non-idealities inherent to nanoscale devices, suggesting a practical path toward larger-scale, CMOS-compatible probabilistic computers.
ARDec 25, 2024Code
Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAsPrabhu Vellaisamy, Harideep Nair, Thomas Kang et al.
The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep learning accelerators (DLA) remain limited due to processing element (PE) array dataflow differences. This work presents Tempus Core, a convolution core with highly scalable unary-based PE array comprising of tub (temporal-unary-binary) multipliers that seamlessly integrates with the NVDLA (NVIDIA's open-source DLA for accelerating CNNs) while maintaining dataflow compliance and boosting hardware efficiency. Analysis across various datapath granularities shows that for INT8 precision in 45nm CMOS, Tempus Core's PE cell unit (PCU) yields 59.3% and 15.3% reductions in area and power consumption, respectively, over NVDLA's CMAC unit. Considering a 16x16 PE array in Tempus Core, area and power improves by 75% and 62%, respectively, while delivering 5x and 4x iso-area throughput improvements for INT8 and INT4 precisions. Post-place and route analysis of Tempus Core's PCU shows that the 16x4 PE array for INT4 precision in 45nm CMOS requires only 0.017 mm^2 die area and consumes only 6.2mW of total power. We demonstrate that area-power efficient unary-based hardware can be seamlessly integrated into conventional DLAs, paving the path for efficient unary hardware for edge AI inference.
DCMar 12
TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead DecompositionPrabhu Vellaisamy, Shreesh Tripathi, Vignesh Natarajan et al.
Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and decode, we show that aggregate latency, GPU inactivity, or boundedness ratios alone can obscure the dominant optimization target. TaxBreak instead distinguishes cases where optimization should reduce software-stack overhead from cases where the primary win comes from reducing device-side work. We further show that MoE models dispatch 8-11x more kernels per output token than dense models, and that for such host-bound workloads, CPU single-thread performance is a first-order parameter: a faster host CPU reduces orchestration overhead by 10-29% and improves end-to-end latency by up to 14%, even when paired with a slower-clocked GPU. These results position TaxBreak as a diagnostic tool for assessing whether optimization effort should target the software stack or the device-side workload execution.
LGApr 7, 2025
BRIDGES: Bridging Graph Modality and Large Language Models within EDA TasksWei Li, Yang Zou, Christopher Ellis et al.
While many EDA tasks already involve graph-based data, existing LLMs in EDA primarily either represent graphs as sequential text, or simply ignore graph-structured data that might be beneficial like dataflow graphs of RTL code. Recent studies have found that LLM performance suffers when graphs are represented as sequential text, and using additional graph information significantly boosts performance. To address these challenges, we introduce BRIDGES, a framework designed to incorporate graph modality into LLMs for EDA tasks. BRIDGES integrates an automated data generation workflow, a solution that combines graph modality with LLM, and a comprehensive evaluation suite. First, we establish an LLM-driven workflow to generate RTL and netlist-level data, converting them into dataflow and netlist graphs with function descriptions. This workflow yields a large-scale dataset comprising over 500,000 graph instances and more than 1.5 billion tokens. Second, we propose a lightweight cross-modal projector that encodes graph representations into text-compatible prompts, enabling LLMs to effectively utilize graph data without architectural modifications. Experimental results demonstrate 2x to 10x improvements across multiple tasks compared to text-only baselines, including accuracy in design retrieval, type prediction and perplexity in function description, with negligible computational overhead (<1% model weights increase and <30% additional runtime overhead). Even without additional LLM finetuning, our results outperform text-only by a large margin. We plan to release BRIDGES, including the dataset, models, and training flow.