CL AIJun 12, 2025

Hybrid-NL2SVA: Integrating RAG and Finetuning for LLM-based NL2SVA

Weihua Xiao, Derek Ekberg, Siddharth Garg, Ramesh Karri

arXiv:2506.21569v16.75 citationsh-index: 11MLCAD

Originality Incremental advance

AI Analysis

This addresses the labor-intensive and error-prone task of writing SVAs for hardware engineers, but it is incremental as it builds on existing LLM methods with domain-specific enhancements.

The paper tackled the problem of automating the translation of natural language to SystemVerilog Assertions (NL2SVA) for hardware design verification by proposing a customized retrieval-augmented generation (RAG) framework and a synthetic fine-tuning dataset, resulting in improvements such as a 58.42% increase in functionality matched SVAs over GPT-4o-mini and a 59.05% gain over the base Qwen model.

SystemVerilog Assertions (SVAs) are critical for verifying the correctness of hardware designs, but manually writing them from natural language property descriptions, i.e., NL2SVA, remains a labor-intensive and error-prone task. Recent advances in large language models (LLMs) offer opportunities to automate this translation. However, existing models still struggle with understanding domain-specific syntax and semantics. To enhance LLM performance in NL2SVA, we propose a customized retrieval-augmented generation (RAG) framework and a synthetic fine-tuning dataset that together improve LLM's performance. To further improve lightweight models over NL2SVA, our fine-tuning dataset provides prompt-guided explanations that teach LLMs the layer-by-layer construction process of concurrent SVAs, enabling supervised fine-tuning that greatly improves syntax and functionality accuracy. To evaluate the performance of LLMs over NL2SVA, we construct the largest evaluation dataset for NL2SVA, comprising 40 Verilog designs and 229 formally verified SVAs with detailed annotations. Experimental results show that our customized RAG framework increases the number of functionality matched SVAs by 58.42% over GPT-4o-mini, while Qwen2.5-Coder-7B-Instruct fine-tuned on our fine-tuning dataset and integrated with HybridRetrieval achieves a 59.05% over the base Qwen model.

View on arXiv PDF

Similar