DCApr 26

ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding

arXiv:2604.2355315.8

Predicted impact top 74% in DC · last 90 daysOriginality Incremental advance

AI Analysis

For LLM inference practitioners, this work reduces decoding latency by fusing more operations, but the gains are incremental (1.34x) and limited to specific GPU architectures.

ClusterFusion++ extends cluster-level fusion to cover the entire Transformer decoder block for GPT-NeoX/Pythia models, achieving 1.34x throughput improvement on Pythia-2.8B and similar gains on Pythia-6.9B on an RTX 5090-class GPU while preserving output fidelity.

Large language model (LLM) decoding is latency-sensitive and often bottlenecked by fragmented operator execution and repeated off-chip materialization of intermediate tensors. Prior work expands fusion scope by leveraging thread-block clusters and on-chip inter-block collectives to fuse attention-side operators such as QKV projection, attention, and output projection. We develop ClusterFusion++, a CUDA-level extension that broadens fusion to the full Transformer decoder block for GPT-NeoX/Pythia models: LayerNorm -> QKV -> RoPE -> decode attention -> output projection -> Post-LN -> MLP -> residual. We additionally engineer a CUDA-Graph-compatible execution mode with persistent Tensor Memory Accelerator (TMA) descriptors to reduce per-step overhead. On an NVIDIA RTX 5090-class GPU, ClusterFusion++ improves throughput by 1.34x for Pythia-2.8B and yields similar gains for Pythia-6.9B, while maintaining high output fidelity (near-token-identical generation, with minor non-determinism from FP16 atomics).

View on arXiv PDF

Similar