QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis
This work solves the problem of poor performance in hardware assertion generation for verification engineers, though it is incremental as it builds on existing LLM methods with a novel data synthesis approach.
The paper tackles the problem of generating SystemVerilog Assertions (SVAs) for hardware verification by addressing data scarcity and semantic equivalence challenges, resulting in CodeV-SVA-14B achieving 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1.
SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general-purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high-quality real-world SVA corpora and the lack of reliable methods to determine NL-SVA semantic equivalence. For the former, large-scale open-source RTLs are used to guide LLMs to generate real-world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV-SVA, a series of SVA generation models. Notably, CodeV-SVA-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.