CL AI NEDec 6, 2024

BEExformer: A Fast Inferencing Binarized Transformer with Early Exits

Wazib Ansar, Saptarsi Goswami, Amlan Chakrabarti

arXiv:2412.05225v21.0h-index: 17IEEE Trans Sustain Comput

Originality Incremental advance

AI Analysis

This addresses deployment challenges for LLMs on constrained resources, offering a Pareto-optimal trade-off, though it is incremental as it builds on existing binarization and early exit techniques.

The paper tackles the inefficiency of large transformer models by introducing BEExformer, which combines binarization and early exits to reduce model size by 21.30 times and FLOPs by 52.08% while improving accuracy by 2.89%.

Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. However, their enormous size and processing requirements hinder deployment on constrained resources. To enhance efficiency, binarization and Early Exit (EE) have proved to be effective solutions. However, binarization may lead to performance loss as reduced precision affects gradient estimation and parameter updates. Besides, research on EE mechanisms is still in its early stages. To address these challenges, we introduce Binarized Early Exit Transformer (BEExformer), the first-ever selective learning-based transformer integrating Binarization-Aware Training (BAT) with EE for efficient and fast textual inference. Each transformer block has an integrated Selective-Learn Forget Network (SLFN) to enhance contextual retention while eliminating irrelevant information. The BAT employs a differentiable second-order approximation to the sign function, enabling gradient computation that captures both the sign and magnitude of the weights. This aids in 21.30 times reduction in model size. The EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation. This accelerates inference by reducing FLOPs by 52.08% and even improves accuracy by 2.89% by resolving the "overthinking" problem inherent in deep networks. Extensive evaluation through comparison with the SOTA methods and various ablations across six datasets covering multiple NLP tasks demonstrates its Pareto-optimal performance-efficiency trade-off.

View on arXiv PDF

Similar