LGDec 29, 2025
KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at MetaGang Liao, Hongsen Qin, Ying Wang et al.
Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta's AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.
DBApr 13, 2024
Bullion: A Column Store for Machine LearningGang Liao, Ye Liu, Jianjun Chen et al.
The past two decades have witnessed significant success in applying columnar storage to data warehousing and analytics. However, the rapid growth of machine learning poses new challenges. This paper presents Bullion, a columnar storage system tailored for machine learning workloads. Bullion addresses the complexities of data compliance, optimizes the encoding of long sequence sparse features, efficiently manages wide-table projections, introduces feature quantization in storage, enables quality-aware sequential reads for multimodal training data, and provides a comprehensive cascading encoding framework that unifies diverse encoding schemes through modular, composable interfaces. By aligning with the evolving requirements of ML applications, Bullion facilitates the application of columnar storage and processing to modern application scenarios such as those within advertising, recommendation systems, and Generative AI. Preliminary experimental results and theoretical analysis demonstrate Bullion's improved ability to deliver strong performance in the face of the unique demands of machine learning workloads compared to existing columnar storage solutions. Bullion significantly reduces I/O costs for deletion compliance, achieves substantial storage savings with its optimized encoding scheme for sparse features, and improves metadata parsing speed for wide-table projections. These advancements enable Bullion to become an important component in the future of machine learning infrastructure, enabling organizations to efficiently manage and process the massive volumes of data required for training and inference in modern AI applications.
SEJul 23, 2012
A New P2N Approach to Software Development Under the ClusteringGang Liao, Lei Liu, Lian Luo
In this computer era of rapid development, software development can be seen everywhere, but a lot of softwares are dead in modern development of software. Just as The Mythical Man-Month said, it exists a problem in the software development, and the problem is interflow.A lock of interflow can be said great calamity. Clustering is a environment to breed new life. In this thesis, we elaborate how P2N can be used to thinking, planning, developing, collaborating, releasing. And the approach that make your team and organization more perfect.
SEMay 24, 2012
An Adaptive XP-based approach to Agile DevelopmentGang Liao, Lei Liu, Lian Luo
Software design is gradually becoming open, distributed, pervasive, and connected. It is a sad statistical fact that software projects are scientifically fragile and tend to fail more than other engineering fields. Agile development is a philosophy. And agile methods are processes that support the agile philosophy. XP places a strong emphasis on technical practices in addition to the more common teamwork and structural practices. In this paper, we elaborate how XP practices can be used to thinking, collaborating, releasing, planning, developing. And the state that make your team and organization more successful.