A New Benchmark for the Appropriate Evaluation of RTL Code Optimization
This addresses the need for better evaluation of generative models in hardware design optimization, though it is incremental as it builds on existing benchmarks by adding optimization-focused metrics.
The paper tackles the problem of evaluating LLMs for RTL code optimization by introducing RTL-OPT, a benchmark with 36 handcrafted designs that assesses optimization quality in terms of power, performance, and area, providing an automated evaluation framework for standardized assessment.
The rapid progress of artificial intelligence increasingly relies on efficient integrated circuit (IC) design. Recent studies have explored the use of large language models (LLMs) for generating Register Transfer Level (RTL) code, but existing benchmarks mainly evaluate syntactic correctness rather than optimization quality in terms of power, performance, and area (PPA). This work introduces RTL-OPT, a benchmark for assessing the capability of LLMs in RTL optimization. RTL-OPT contains 36 handcrafted digital designs that cover diverse implementation categories including combinational logic, pipelined datapaths, finite state machines, and memory interfaces. Each task provides a pair of RTL codes, a suboptimal version and a human-optimized reference that reflects industry-proven optimization patterns not captured by conventional synthesis tools. Furthermore, RTL-OPT integrates an automated evaluation framework to verify functional correctness and quantify PPA improvements, enabling standardized and meaningful assessment of generative models for hardware design optimization.