Chia-Heng Tu

DC
h-index11
5papers
Novelty38%
AI Score40

5 Papers

66.3ARApr 16
Exploring LLM-based Verilog Code Generation with Data-Efficient Fine-Tuning and Testbench Automation

Mu-Chi Chen, Po-Hsuan Huang, Yu-Hung Kao et al.

Recent advances in large language models have improved code generation, but their use in hardware description languages is still limited. Moreover, training data and testbenches for these models are often scarce. This paper presents a workflow that uses multi-agent models to generate testbenches for high-quality fine-tuning data. By automating testbench creation, the fine-tuned model for the specification-to-Verilog task achieves performance comparable to state-of-the-art methods on the refined VerilogEval v2 benchmark while using less training data. This study provides a basis for future work on LLM-based HDL generation and automated verification.

67.1QUANT-PHApr 14
Large-Scale Quantum Circuit Simulation on HPC Cluster via Cache Blocking, Boosting, and Gate Fusion Optimization

Chuan-Chi Wang, Yan-Jie Wang, Chia-Heng Tu et al.

Quantum circuit simulation is crucial for the development of quantum algorithms, particularly given the high cost and noise limitations of physical quantum hardware. While full-state quantum circuit simulation is commonly employed for prototyping and debugging, it poses challenges because of the exponential increase in simulation time for large quantum systems. In this work, we propose an extensible framework designed to enhance simulation performance by optimizing both data locality and computational efficiency, thereby addressing these challenges. This framework is seamlessly integrated with an optimizer that restructures quantum circuits and a simulator that adjusts execution strategies for various quantum operations. For the newly developed components, merge booster and diagonal detector, the underlying algorithms are inspired by the principles of quantum entanglement and gate fusion, as well as by the limitations identified in existing third-party simulation libraries. The experiments were conducted on eight DGX-H100 workstations, each equipped with eight NVIDIA H100 GPUs, employing both gate-level and circuit-level benchmarks. The results indicate a speedup of up to 160 times for circuit-level benchmarks and an acceleration of up to 34 times for diagonal-heavy gate-level benchmarks compared to existing simulators. The proposed methodologies are anticipated to deliver more robust and faster quantum circuit simulations, thereby fostering the advancement of novel quantum algorithms.

10.7DCMar 27
ParaQAOA: Efficient Parallel Divide-and-Conquer QAOA for Large-Scale Max-Cut Problems Beyond 10,000 Vertices

Po-Hsuan Huang, Xie-Ru Li, Chi Chuang et al.

Quantum Approximate Optimization Algorithm (QAOA) has emerged as a promising solution for combinatorial optimization problems using a hybrid quantum-classical framework. Among combinatorial optimization problems, the Maximum Cut (Max-Cut) problem is particularly important due to its broad applicability in various domains. While QAOA-based Max-Cut solvers have been developed, they primarily favor solution accuracy over execution efficiency, which significantly limits their practicality for large-scale problems. To address the limitation, we propose ParaQAOA, a parallel divide-and-conquer QAOA framework that leverages parallel computing hardware to efficiently solve large Max-Cut problems. ParaQAOA significantly reduces runtime by partitioning large problems into subproblems and solving them in parallel while preserving solution quality. This design not only scales to graphs with tens of thousands of vertices but also provides tunable control over accuracy-efficiency trade-offs, making ParaQAOA adaptable to diverse performance requirements. Experimental results demonstrate that ParaQAOA achieves up to 1,600x speedup over state-of-the-art methods on Max-Cut problems with 400 vertices while maintaining solution accuracy within 2% of the best-known solutions. Furthermore, ParaQAOA solves a 16,000-vertex instance in 19 minutes, compared to over 13.6 days required by the best-known approach. These findings establish ParaQAOA as a practical and scalable framework for large-scale Max-Cut problems under stringent time constraints.

DCJun 30, 2025
Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model

Mu-Chi Chen, Po-Hsuan Huang, Xiangrui Ke et al.

Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) with significant advancements such as OpenAI's ChatGPT, Meta's Llama, and Databricks' DBRX. This paper addresses the cost and scalability challenges encountered when constructing private LLM systems for personal or small group services, as aimed by Apple Intelligence. A Mac Studio cluster with Apple's M2 Ultra chips is established as a cost-efficient solution to host and accelerate the pretrained DBRX model with the Mixture-of-Experts (MoE) architecture. Our performance analysis reveal that parallel execution of the model's experts across two to four machine nodes significantly reduces inference time. We find that computation time for the experts is comparable to the communication time for exchanging their outputs, emphasizing the importance of network latency over bandwidth. We also observe significant management overhead due to Apple software stack's memory management logic. Based on these findings, we develop optimization schemes to eliminate the memory management overhead. As a result, the Mac Studio cluster is 1.15 times more cost-efficient than the state-of-the-art AI supercomputer with NVIDIA H100 GPUs. In addition, we construct a performance model to estimate system performance under varying configurations, and the model provides valuable insights for designing private LLM systems.

LGDec 3, 2020
ResPerfNet: Deep Residual Learning for Regressional Performance Modeling of Deep Neural Networks

Chuan-Chi Wang, Ying-Chiao Liao, Chia-Heng Tu et al.

The rapid advancements of computing technology facilitate the development of diverse deep learning applications. Unfortunately, the efficiency of parallel computing infrastructures varies widely with neural network models, which hinders the exploration of the design space to find high-performance neural network architectures on specific computing platforms for a given application. To address such a challenge, we propose a deep learning-based method, ResPerfNet, which trains a residual neural network with representative datasets obtained on the target platform to predict the performance for a deep neural network. Our experimental results show that ResPerfNet can accurately predict the execution time of individual neural network layers and full network models on a variety of platforms. In particular, ResPerfNet achieves 8.4% of mean absolute percentage error for LeNet, AlexNet and VGG16 on the NVIDIA GTX 1080Ti, which is substantially lower than the previously published works.