AR LGApr 8

From LLM to Silicon: RL-Driven ASIC Architecture Exploration for On-Device AI Inference

arXiv:2604.0752659.6h-index: 1

AI Analysis

This addresses the challenge of efficient hardware design for AI inference, offering automated optimization across nodes, but it is incremental as it builds on existing RL methods.

The paper tackles the problem of optimizing ASIC architecture for on-device AI inference by developing an RL-driven compiler that jointly explores design parameters across process nodes, achieving 29,809 tokens per second for Llama 3.1 8B at 3nm and under 13 mW for SmolVLM.

We present an RL-driven compiler that jointly optimizes ASIC architecture, memory hierarchy, and workload partitioning for AI inference across 3nm to 28nm. The design space is formulated as a single Markov Decision Process with mixed discrete-continuous actions and a unified Power-Performance-Area (PPA) objective. Soft Actor-Critic (SAC) with Mixture-of-Experts gating explores the joint space of mesh topology, per-core microarchitecture, and operator placement. We validate on two workloads, Llama 3.1 8B FP16 (high-performance mode, 29809 tokens per second at 3nm) and SmolVLM (low-power mode, less than 13 mW at all nodes, 10 MHz). Across 7 process nodes, the RL automatically adapts mesh sizes and per-tile configurations, including heterogeneous FETCH, VLEN, and memory allocation without node-specific manual retuning.

View on arXiv PDF

Similar