Does Self-Attention Need Separate Weights in Transformers?
This work addresses parameter efficiency and training time for NLP models, offering an incremental improvement with potential benefits for noisy and out-of-domain data.
The paper tackles the computational complexity and parameter inefficiency of self-attention in Transformers by introducing a shared weight self-attention-based BERT model that uses one weight matrix for Key, Value, and Query representations, reducing training parameters by 66.53% and improving accuracy on GLUE tasks by up to 5.81% over baselines.
The success of self-attention lies in its ability to capture long-range dependencies and enhance context understanding, but it is limited by its computational complexity and challenges in handling sequential data with inherent directionality. This work introduces a shared weight self-attention-based BERT model that only learns one weight matrix for (Key, Value, and Query) representations instead of three individual matrices for each of them. Our shared weight attention reduces the training parameter size by more than half and training time by around one-tenth. Furthermore, we demonstrate higher prediction accuracy on small tasks of GLUE over the BERT baseline and in particular a generalization power on noisy and out-of-domain data. Experimental results indicate that our shared self-attention method achieves a parameter size reduction of 66.53% in the attention block. In the GLUE dataset, the shared weight self-attention-based BERT model demonstrates accuracy improvements of 0.38%, 5.81%, and 1.06% over the standard, symmetric, and pairwise attention-based BERT models, respectively. The model and source code are available at Anonymous.