InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer
This work addresses computational and energy efficiency in NLP models, though it appears incremental as it builds on existing DistilBERT architecture with modifications.
This paper tackles the problem of optimizing transformer-based language models by integrating model compression with a novel inhibitor attention mechanism that uses Manhattan distances and ReLU activations instead of matrix multiplications and softmax, achieving competitive performance on NLP benchmarks like GLUE and sentiment analysis tasks.
This work explores optimizing transformer-based language models by integrating model compression techniques with inhibitor attention, a novel alternative attention mechanism. Inhibitor attention employs Manhattan distances and ReLU activations instead of the matrix multiplications and softmax activation of the conventional scaled dot-product attention. This shift offers potential computational and energy savings while maintaining model effectiveness. We propose further adjustments to improve the inhibitor mechanism's training efficiency and evaluate its performance on the DistilBERT architecture. Our knowledge distillation experiments indicate that the modified inhibitor transformer model can achieve competitive performance on standard NLP benchmarks, including General Language Understanding Evaluation (GLUE) and sentiment analysis tasks.