CLAILGMar 20, 2025

InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer

arXiv:2503.15983v12 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses computational and energy efficiency in NLP models, though it appears incremental as it builds on existing DistilBERT architecture with modifications.

This paper tackles the problem of optimizing transformer-based language models by integrating model compression with a novel inhibitor attention mechanism that uses Manhattan distances and ReLU activations instead of matrix multiplications and softmax, achieving competitive performance on NLP benchmarks like GLUE and sentiment analysis tasks.

This work explores optimizing transformer-based language models by integrating model compression techniques with inhibitor attention, a novel alternative attention mechanism. Inhibitor attention employs Manhattan distances and ReLU activations instead of the matrix multiplications and softmax activation of the conventional scaled dot-product attention. This shift offers potential computational and energy savings while maintaining model effectiveness. We propose further adjustments to improve the inhibitor mechanism's training efficiency and evaluate its performance on the DistilBERT architecture. Our knowledge distillation experiments indicate that the modified inhibitor transformer model can achieve competitive performance on standard NLP benchmarks, including General Language Understanding Evaluation (GLUE) and sentiment analysis tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes