Alternatives to the Scaled Dot Product for Attention in the Transformer Neural Network Architecture
This work addresses a specific bottleneck in transformer architectures for machine learning practitioners, but it is incremental as it modifies an existing component without broad application changes.
The paper tackles the problem of vanishing gradients in transformer attention by proposing alternative scalings to the standard scaled dot product, such as dividing by the sum of key lengths, and shows through simulations that these alternatives are more effective in avoiding gradient issues in many situations.
The transformer neural network architecture uses a form of attention in which the dot product of query and key is divided by the square root of the key dimension before applying softmax. This scaling of the dot product is designed to avoid the absolute value of the dot products becoming so large that applying softmax leads to vanishing gradients. In this paper, we propose some alternative scalings, including dividing the dot product instead by the sum of the key lengths before applying softmax. We use simulated keys and queries to show that in many situations this appears to be more effective at avoiding regions where applying softmax leads to vanishing gradients.