RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model
This work addresses the challenge of fair comparison and comprehensive evaluation for LLM-based solutions in agile hardware design, though it is incremental as it builds on existing explorations of LLMs for RTL generation.
The authors tackled the lack of a standardized benchmark for evaluating large language models (LLMs) in generating hardware design RTL from natural language, by introducing RTLLM, an open-source benchmark that provides quantitative evaluation across syntax, functionality, and design quality goals, and they developed a self-planning prompt technique that significantly boosts GPT-3.5's performance in this benchmark.
Inspired by the recent success of large language models (LLMs) like ChatGPT, researchers start to explore the adoption of LLMs for agile hardware design, such as generating design RTL based on natural-language instructions. However, in existing works, their target designs are all relatively simple and in a small scale, and proposed by the authors themselves, making a fair comparison among different LLM solutions challenging. In addition, many prior works only focus on the design correctness, without evaluating the design qualities of generated design RTL. In this work, we propose an open-source benchmark named RTLLM, for generating design RTL with natural language instructions. To systematically evaluate the auto-generated design RTL, we summarized three progressive goals, named syntax goal, functionality goal, and design quality goal. This benchmark can automatically provide a quantitative evaluation of any given LLM-based solution. Furthermore, we propose an easy-to-use yet surprisingly effective prompt engineering technique named self-planning, which proves to significantly boost the performance of GPT-3.5 in our proposed benchmark.