ARApr 30Code
RuC: HDL-Agnostic Rule Completion Benchmark GenerationArnau Ayguadé Domingo, Miquel Alberti-Binimelis, Cristian Gutierrez-Gomez et al.
Large Language Models (LLMs) have rapidly improved in performance across code-related tasks, making their integration into Register Transfer Level (RTL) development increasingly attractive. Mimicking the behavior of inline code assistants, many benchmarks evaluate LLMs' capabilities in code completion, either assessing the generation of entire hardware modules or the completion of a single line within a module. However both of these approaches lack the ability to control the granularity of the code-completion sample size and the syntactic range of completions. To overcome these limitations, we present a framework for language-agnostic rule completion (RuC), a grammar-driven, rule-selectable benchmark generator that automatically produces RTL code-completion tasks from a set of input hardware description sources. RuC uses the target Hardware Description Language (HDL) grammar to mask syntactically defined code regions and prompts a model to regenerate them using the surrounding unmasked code as context, enabling a controlled and scalable evaluation of the domain-specific model's code-understanding capabilities, ranging from assignments to the reconstruction of entire logic blocks. We use RuC to generate two SystemVerilog rule-completion benchmarks from the Tiny Tapeout shuttle TT07 and the CVE2 RISC-V core to demonstrate RuC's applicability to a broad range of designs, and conduct a comparative study of the code completion capabilities of modern open-source LLMs across diverse settings. Results indicate that completion performance strongly depends on the model type, the grammatical structure of the masked region, and the prompting strategy. Specifically, the highest scores are obtained with Fill-in-the-Middle (FIM) prompting. These findings highlight the value of grammar-driven, arbitrarily granular benchmarks for meaningful evaluation of LLM capabilities in RTL development workflows.
ARApr 29Code
EMiX: Emulating Beyond Single-FPGA LimitsAlexander Kropotov, Miquel Moreto, Behzad Salami
FPGA-level emulation is a key step in pre-silicon chip design validation. However, emulating large-scale multi-core systems increasingly exceed the hardware resource capacity of a single FPGA, limiting the feasibility of full-system emulation. To address this challenge, we introduce EMiX, a scalable multi-FPGA framework that enables distributed emulation of multi-core RISC-V architectures beyond single-FPGA resource limits. EMiX systematically partitions a monolithic multi-core design into multiple components and deploys them across multiple interconnected FPGAs, effectively exploiting inter-FPGA interconnects to balance scalability and performance without requiring fundamental RTL redesign. We prototype EMiX with a 64-core architecture across eight interconnected Alveo U55c FPGAs (scalable on core and FPGA counts), successfully demonstrating full-system execution including Linux boot. EMiX will be released as an open-source platform.
ARDec 23, 2025
NotSoTiny: A Large, Living Benchmark for RTL Code GenerationRazine Moundir Ghorab, Emanuele Parisi, Cristian Gutierrez et al.
LLMs have shown early promise in generating RTL code, yet evaluating their capabilities in realistic setups remains a challenge. So far, RTL benchmarks have been limited in scale, skewed toward trivial designs, offering minimal verification rigor, and remaining vulnerable to data contamination. To overcome these limitations and to push the field forward, this paper introduces NotSoTiny, a benchmark that assesses LLM on the generation of structurally rich and context-aware RTL. Built from hundreds of actual hardware designs produced by the Tiny Tapeout community, our automated pipeline removes duplicates, verifies correctness and periodically incorporates new designs to mitigate contamination, matching Tiny Tapeout release schedule. Evaluation results show that NotSoTiny tasks are more challenging than prior benchmarks, emphasizing its effectiveness in overcoming current limitations of LLMs applied to hardware design, and in guiding the improvement of such promising technology.
ARMay 4
Performance and Energy Benefits of MRDIMMsPau Díaz, Mariana Carmin, Pouya Esmaili-Dokht et al.
Multiplexed Rank DIMMs (MRDIMMs) have recently emerged as memory devices that enable higher bandwidth without increasing DRAM chip frequencies. This paper presents a detailed performance, power and energy evaluation of a production server with high-end MRDIMM main memory. The memory system upgrade from conventional registered DIMMs (RDIMMs) to MRDIMMs extends the bandwidth by 41% yielding 27-41% higher performance for bandwidth-bound workloads. Additionally, the latency improvement reaches hundreds of nanoseconds, benefiting a broad class of workloads sensitive to memory latency. At the same bandwidth utilization levels, RDIMMs and MRDIMMs exhibit similar power consumption. In the MRDIMM-extended bandwidth region, the performance improvements largely exceed the power increase, delivering up to 30% server energy savings for memory-bound workloads.
ARApr 29
Verification and Validation (V&V)-in-the-Loop for RISC-V Design: The Holistic Vision of BZLSajjad Ahmed, Alexander Kropotov, Roberto Ignacio Genovese et al.
The Barcelona Zetascale Lab (BZL) project aims to strengthening Europe's capacity in the design and manufacture of RISC-V based high-performance computing chips. In this context, we present a holistic pre-silicon verification and validation (V&V) methodology targeting highly robust RISC-V chip designs. This paper provides an overview of BZL's V&V approach, which integrates three complementary platforms: (1) a UVM-based verification environment to thoroughly validate RTL functionality; (2) an FPGA-based validation platform that enables system-level pre-silicon hardware-software RTL validation; and (3) a CI/CD flow that continuously automates build, deployment, and tests across these domains. By embedding these platforms into an industrial-grade V&V loop and exploiting large-scale CPU and FPGA hardware infrastructures, the BZL project enables continuous evolution of reliable hardware development and software integration. We believe that the BZL's V&V flow represents a robust and scalable foundation for ensuring the pre-silicon functional correctness and system level validation of RISC-V chip designs, and can serve as a key enabler for strategic initiatives in Europe, such as EPI and DARE, and beyond.
ARMar 31, 2025
TuRTLe: A Unified Evaluation of LLMs for RTL GenerationDario Garcia-Gasulla, Gokcen Kestor, Emanuele Parisi et al.
The rapid advancements in LLMs have driven the adoption of generative AI in various domains, including Electronic Design Automation (EDA). Unlike traditional software development, EDA presents unique challenges, as generated RTL code must not only be syntactically correct and functionally accurate but also synthesizable by hardware generators while meeting performance, power, and area constraints. These additional requirements introduce complexities that existing code-generation benchmarks often fail to capture, limiting their effectiveness in evaluating LLMs for RTL generation. To address this gap, we propose TuRTLe, a unified evaluation framework designed to systematically assess LLMs across key RTL generation tasks. TuRTLe integrates multiple existing benchmarks and automates the evaluation process, enabling a comprehensive assessment of LLM performance in syntax correctness, functional correctness, synthesis, PPA optimization, and exact line completion. Using this framework, we benchmark a diverse set of open LLMs and analyze their strengths and weaknesses in EDA-specific tasks. Our results show that reasoning-based models, such as DeepSeek R1, consistently outperform others across multiple evaluation criteria, but at the cost of increased computational overhead and inference latency. Additionally, base models are better suited in module completion tasks, while instruct-tuned models perform better in specification-to-RTL tasks.