ARLGDec 19, 2024

GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors

arXiv:2412.19829v1h-index: 10
AI Analysis

This work addresses optimization challenges for LLM inference on emerging hardware like Gaudi processors, offering incremental improvements in computational efficiency.

The paper tackled the inefficiency of Transformer-based large language models on Gaudi processors by proposing GFormer, an integrated approach merging sparse and linear attention mechanisms, which improved efficiency and model performance, outperforming state-of-the-art GPUs.

Heterogeneous hardware like Gaudi processor has been developed to enhance computations, especially matrix operations for Transformer-based large language models (LLMs) for generative AI tasks. However, our analysis indicates that Transformers are not fully optimized on such emerging hardware, primarily due to inadequate optimizations in non-matrix computational kernels like Softmax and in heterogeneous resource utilization, particularly when processing long sequences. To address these issues, we propose an integrated approach (called GFormer) that merges sparse and linear attention mechanisms. GFormer aims to maximize the computational capabilities of the Gaudi processor's Matrix Multiplication Engine (MME) and Tensor Processing Cores (TPC) without compromising model quality. GFormer includes a windowed self-attention kernel and an efficient outer product kernel for causal linear attention, aiming to optimize LLM inference on Gaudi processors. Evaluation shows that GFormer significantly improves efficiency and model performance across various tasks on the Gaudi processor and outperforms state-of-the-art GPUs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes