ARCRLGMar 17, 2025

VeriContaminated: Assessing LLM-Driven Verilog Coding for Data Contamination

arXiv:2503.13572v314 citationsh-index: 22Has Code2025 IEEE International Conference on LLM-Aided Design (ICLAD)
Originality Synthesis-oriented
AI Analysis

This addresses contamination risks in hardware coding evaluations, an overlooked area, but is incremental as it applies known detection methods to a new domain.

The study analyzed data contamination in LLM-driven Verilog code generation, confirming it as a critical issue and exploring trade-offs between code quality and fairness in benchmarking.

Large Language Models (LLMs) have revolutionized code generation, achieving exceptional results on various established benchmarking frameworks. However, concerns about data contamination - where benchmark data inadvertently leaks into pre-training or fine-tuning datasets - raise questions about the validity of these evaluations. While this issue is known, limiting the industrial adoption of LLM-driven software engineering, hardware coding has received little to no attention regarding these risks. For the first time, we analyze state-of-the-art (SOTA) evaluation frameworks for Verilog code generation (VerilogEval and RTLLM), using established methods for contamination detection (CCD and Min-K% Prob). We cover SOTA commercial and open-source LLMs (CodeGen2.5, Minitron 4b, Mistral 7b, phi-4 mini, LLaMA-{1,2,3.1}, GPT-{2,3.5,4o}, Deepseek-Coder, and CodeQwen 1.5), in baseline and fine-tuned models (RTLCoder and Verigen). Our study confirms that data contamination is a critical concern. We explore mitigations and the resulting trade-offs for code quality vs fairness (i.e., reducing contamination toward unbiased benchmarking).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes