LGAIMay 8, 2025

QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

arXiv:2505.06302v17 citationsh-index: 9
Originality Highly original
AI Analysis

This addresses the need for efficient, portable tensor operator generation across hardware platforms like RISC-V, ARM, and GPUs, reducing manual effort and improving performance for AI applications.

The paper tackles the problem of automatically generating high-performance tensor operators for diverse hardware architectures, which is crucial for LLMs and DNNs, and achieves up to 1291× performance improvement over vanilla LLMs and surpasses human expert libraries like OpenBLAS and cuBLAS by up to 251% and 124%, respectively, while reducing development costs by 200×.

Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability.LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to $1291 \times$ performance improvement. Even compared with human experts, QiMeng-TensorOp could reach $251 \%$ of OpenBLAS on RISC-V CPUs, and $124 \%$ of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by $200 \times$ compared with human experts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes