ARJan 31, 2024Code
ConSmax: Hardware-Friendly Alternative Softmax with Learnable ParametersShiwei Liu, Guanchen Tao, Yifei Zou et al.
The self-attention mechanism distinguishes transformer-based large language models (LLMs) apart from convolutional and recurrent neural networks. Despite the performance improvement, achieving real-time LLM inference on silicon remains challenging due to the extensive use of Softmax in self-attention. In addition to the non-linearity, the low arithmetic intensity significantly limits processing parallelism, especially when working with longer contexts. To address this challenge, we propose Constant Softmax (ConSmax), a software-hardware co-design that serves as an efficient alternative to Softmax. ConSmax utilizes differentiable normalization parameters to eliminate the need for maximum searching and denominator summation in Softmax. This approach enables extensive parallelization while still executing the essential functions of Softmax. Moreover, a scalable ConSmax hardware design with a bitwidth-split look-up table (LUT) can achieve lossless non-linear operations and support mixed-precision computing. Experimental results show that ConSmax achieves a minuscule power consumption of 0.2mW and an area of 0.0008mm^2 at 1250MHz working frequency in 16nm FinFET technology. For open-source contribution, we further implement our design with the OpenROAD toolchain under SkyWater's 130nm CMOS technology. The corresponding power is 2.69mW and the area is 0.007mm^2. ConSmax achieves 3.35x power savings and 2.75x area savings in 16nm technology, and 3.15x power savings and 4.14x area savings with the open-source EDA toolchain. In the meantime, it also maintains comparable accuracy on the GPT-2 model and the WikiText103 dataset. The project is available at https://github.com/ReaLLMASIC/ConSmax
LGMay 17
LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language ModelsXinting Jiang, Junyi Luo, Ruichen Qi et al.
Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal budgets that make architectural choice and accelerator-specific cost central to efficient inference. We present LLMForge, a hardware-aware neural architecture search (NAS) framework whose three composable contributions together make edge-LM architecture search hardware-conditioned, since different substrates impose different hardware cost bottlenecks. Infinite-Head Attention (IHA) decouples the number of query heads, KV groups, and per-head query/key and value dimensions, expanding the feasible per-layer attention configuration space by approximately 400x over grouped-query attention within our search-space ranges. Forge-Former, an encoder-based surrogate for ranking architectural candidates, outperforms MLP and random-forest baselines. Forge-DSE, an NSGA-II-based design-space-exploration engine, pairs Forge-Former with a multi-backend hardware cost model spanning GPUs, systolic accelerators, and ring-dataflow edge accelerators. Across four different hardware substrates, the searches converge to visibly different architectures whose shapes track each substrate's cost bottleneck. On the multi-chip ring substrate, our co-search returns three 300M-scale deployment-aware variants on the Pareto front. Each is re-trained on FineWeb-Edu-10BT under matched recipe against SmolLM2-360M and Qwen-0.5B architecture baselines. The accurate variant has the lowest validation loss 2.798 and competitive benchmark performance with fewer parameters, the energy-optimized variant lowers energy per token by 40%, and the latency-optimized variant lowers TTFT and TPOT by 43%.
QUANT-PHMay 4
Mitigating Classical Resource Costs in Quantum Error Correction via Generalized qLDPC PredecodingAlexander Knapen, Junyi Luo, Guanchen Tao et al.
Quantum-classical interfaces (QCIs) for fault-tolerant quantum computing must manage simultaneous, real-time decoding across thousands to millions of logical qubits. Scaling these architectures necessitates sharing expensive decoding resources among logical qubits, which introduces severe resource contention within the QCI. While resolving these bottlenecks through efficient resource distribution remains a persistent challenge, lightweight predecoding holds promise to alleviate strain on shared decoding components by decreasing average latency and decoder usage. Notably, research into both decoder allocation and predecoding has been strictly confined to the surface code. With the growing emphasis on general quantum low-density parity-check (qLDPC) codes, slower decoding speeds will intensify resource contention, while the inherent complexity of these codes will render manual predecoder design unfeasible. To address this gap, we introduce an automated framework designed to generate predecoders for arbitrary qLDPC codes. These automatically constructed predecoders autonomously process over 90% of the decoding workload, cutting overall decoder utilization by up to 3,963x. This includes a reduction of up to 72.71% in computationally demanding ordered statistics decoding (OSD). Furthermore, we detail a highly efficient, pipelined hardware design that allows for the concurrent decoding of approximately 1,200 bivariate bicycle (BB) code logical qubits using a single FPGA. When implemented as a cryogenic ASIC, the architecture scales to support between 36,000 and 360,000 BB code logical qubits, operating within a 1.5 W power limit at 4 K.