QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

arXiv:2604.0857076.81 citationsh-index: 15
Predicted impact top 18% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For researchers evaluating LLMs in quantum computing, this benchmark provides a standardized multi-framework test to separate quantum reasoning from framework familiarity.

QuanBench+ is a unified benchmark for LLM-based quantum code generation across Qiskit, PennyLane, and Cirq. The best one-shot Pass@1 scores are 59.5% (Qiskit), 54.8% (Cirq), and 42.9% (PennyLane); with feedback-based repair, scores rise to 83.3%, 76.2%, and 66.7%, showing progress but also strong framework dependence.

Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes