LGFeb 12

Deep Kernel Fusion for Transformers

arXiv:2602.11808v11 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses a critical performance issue for users of large language models in memory-limited, long-context inference scenarios, though it is incremental as it builds on existing kernel optimization techniques.

The paper tackles the memory bandwidth bottleneck in agentic LLM inference with long contexts, particularly from SwiGLU MLP blocks, by proposing DeepFusionKernel, which reduces HBM traffic and improves cache reuse to achieve up to 13.2% speedup on H100 and 9.7% on A100 over SGLang.

Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to 13.2% speedup on H100 and 9.7% on A100 over SGLang. Integrated with SGLang and paired with a kernel scheduler, DeepFusionKernel ensures consistent accelerations over generation lengths, while remaining adaptable to diverse models, inference configurations, and hardware platforms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes