QUANT-PHJun 9, 2023
Improving Quantum Circuit Synthesis with Machine LearningMathias Weiden, Ed Younis, Justin Kalloor et al.
In the Noisy Intermediate Scale Quantum (NISQ) era, finding implementations of quantum algorithms that minimize the number of expensive and error prone multi-qubit gates is vital to ensure computations produce meaningful outputs. Unitary synthesis, the process of finding a quantum circuit that implements some target unitary matrix, is able to solve this problem optimally in many cases. However, current bottom-up unitary synthesis algorithms are limited by their exponentially growing run times. We show how applying machine learning to unitary datasets permits drastic speedups for synthesis algorithms. This paper presents QSeed, a seeded synthesis algorithm that employs a learned model to quickly propose resource efficient circuit implementations of unitaries. QSeed maintains low gate counts and offers a speedup of $3.7\times$ in synthesis time over the state of the art for a 64 qubit modular exponentiation circuit, a core component in Shor's factoring algorithm. QSeed's performance improvements also generalize to families of circuits not seen during the training process.
50.4ARMay 31
Linear Complexity Fermionic Simulation on Quantum Devices with Hardware Connectivity ConstraintsXiangyu Gao, Winston Li, Jiakang Li et al.
Simulating fermionic systems on quantum hardware requires compiling fermionic Hamiltonians into executable quantum circuits. Existing approaches treat each compilation stage independently, applying heuristics with localized objectives that produce circuits with superquartic gate count and depth scaling and compilation times reaching several hours for large instances. We present Accordion, an end-to-end framework that co-designs the fermion-to-qubit mapping with circuit synthesis and hardware routing. Accordion fixes the Jordan Wigner mapping, which despite its higher Pauli weight produces Pauli operators with structural regularity that enables provably efficient circuit generation. For full-rank all-to-all electronic structure Hamiltonians, we prove O(N^4) gate count and circuit depth, matching the information-theoretic lower bound imposed by the Theta(N^4) second excitation terms. On linear, IBM heavy-hex, and square-grid architectures, Accordion reduces gate count by up to 79% and circuit depth by up to 77% relative to the best baseline.
22.4QUANT-PHApr 3
Characterizing and Benchmarking Dynamic Quantum CircuitsSumeet Shirgure, Efekan Kökcü, Anupam Mitra et al.
Dynamic quantum circuits with mid-circuit measurements (MCMs) and feed-forward operations play a crucial role in various applications, such as quantum error correction and quantum algorithms. With advancements in quantum hardware enabling the implementation of MCM and feed-forward loops, the use of dynamic circuits has become increasingly prevalent. There is a significant need for a benchmarking framework specially designed for dynamic circuits to capture their unique properties, as current benchmarking tools are designed primarily for unitary circuits and cannot be trivially extended to dynamic circuits. We propose dynamarq, a scalable and hardware-agnostic benchmarking framework for dynamic circuits. We collect a set of dynamic circuit benchmarks spanning various applications and propose a broad set of circuit features to characterize the structure of these dynamic circuits. We run them on two IBM quantum processors and the Quantinuum Helios-1E emulator, and propose scalable, application-dependent fidelity scores for each benchmark based on hardware execution results. We perform statistical modeling to identify correlations between circuit features and fidelity scores, and demonstrate highly accurate fidelity prediction using our model. Our model parameters are also transferable across hardware backends and calibration cycles. Our framework facilitates the understanding of dynamic circuit structures and provides insights for designing and optimizing dynamic circuits to achieve high execution fidelity on quantum hardware.
SEMay 13, 2025
Leveraging AI for Productive and Trustworthy HPC Software: Challenges and Research DirectionsKeita Teranishi, Harshitha Menon, William F. Godoy et al.
We discuss the challenges and propose research directions for using AI to revolutionize the development of high-performance computing (HPC) software. AI technologies, in particular large language models, have transformed every aspect of software development. For its part, HPC software is recognized as a highly specialized scientific field of its own. We discuss the challenges associated with leveraging state-of-the-art AI technologies to develop such a unique and niche class of software and outline our research directions in the two US Department of Energy--funded projects for advancing HPC Software via AI: Ellora and Durban.
MANov 21, 2025
Optimizing PyTorch Inference with LLM-Based Multi-Agent SystemsKirill Nagaitsev, Luka Grbcic, Samuel Williams et al.
Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88x speedup on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch.
CLMay 29, 2023
SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training DynamicsArash Ardakani, Altan Haan, Shangyin Tan et al.
Transformer-based models, such as BERT and ViT, have achieved state-of-the-art results across different natural language processing (NLP) and computer vision (CV) tasks. However, these models are extremely memory intensive during their fine-tuning process, making them difficult to deploy on GPUs with limited memory resources. To address this issue, we introduce a new tool called SlimFit that reduces the memory requirements of these models by dynamically analyzing their training dynamics and freezing less-contributory layers during fine-tuning. The layers to freeze are chosen using a runtime inter-layer scheduling algorithm. SlimFit adopts quantization and pruning for particular layers to balance the load of dynamic activations and to minimize the memory footprint of static activations, where static activations refer to those that cannot be discarded regardless of freezing. This allows SlimFit to freeze up to 95% of layers and reduce the overall on-device GPU memory usage of transformer-based models such as ViT and BERT by an average of 2.2x, across different NLP and CV benchmarks/datasets such as GLUE, SQuAD 2.0, CIFAR-10, CIFAR-100 and ImageNet with an average degradation of 0.2% in accuracy. For such NLP and CV tasks, SlimFit can reduce up to 3.1x the total on-device memory usage with an accuracy degradation of only up to 0.4%. As a result, while fine-tuning of ViT on ImageNet and BERT on SQuAD 2.0 with a batch size of 128 requires 3 and 2 32GB GPUs respectively, SlimFit enables their fine-tuning on a single 32GB GPU without any significant accuracy degradation.