ARAIMay 26

AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications

arXiv:2605.2747283.9h-index: 11Has Code
Predicted impact top 1% in AR · last 90 daysOriginality Incremental advance
AI Analysis

For hardware verification engineers, this benchmark provides a more realistic evaluation of LLM-based assertion generation, addressing limitations of prior benchmarks with structured specifications and buggy RTL inputs.

AssertLLM2 introduces a benchmark for generating SystemVerilog Assertions from design specifications, featuring 83 real-world designs with buggy RTL variants to evaluate both bug-prevention and bug-hunting capabilities. It establishes rigorous baselines for LLMs, showing that current models achieve limited success in realistic settings.

Assertion-based verification (ABV) is a cornerstone of modern hardware design, yet manually translating design intent into formal SystemVerilog Assertions (SVAs) remains labor-intensive and error-prone. While Large Language Models (LLMs) show promise for automating this process, existing benchmarks remain limited by unrealistic task formulations, weak specification inputs, and oversimplified evaluation. To address these limitations, we introduce AssertLLM2, an open-source benchmark for realistic assertion generation in hardware verification. AssertLLM2 contains 83 real-world designs across 13 functional categories. For each design, the benchmark provides a structured design specification, a verified dependency-complete golden RTL, and systematically mutated buggy RTL variants. These support two practical settings: bug-prevention, where assertions are generated from specifications to guard against design errors, and bug-hunting, where assertions are generated to expose discrepancies between intended behavior and faulty implementations. To the best of our knowledge, AssertLLM2 is the first benchmark to explicitly use buggy RTL as input to evaluate bug-detection capability. AssertLLM2 further adopts a more rigorous evaluation framework spanning syntactic validity, formal provability, coverage, and mutation-based bug detection. Our benchmark enables a more realistic and extensive assessment of assertion generation and establishes rigorous baselines for state-of-the-art LLMs in practical hardware verification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes