CVAILGApr 16, 2025

AttentionDrop: A Novel Regularization Method for Transformer Models

arXiv:2504.12088v2h-index: 4
Originality Highly original
AI Analysis

This addresses overfitting issues in Transformer models for NLP, vision, and speech tasks, offering a novel regularization approach.

The paper tackled overfitting in Transformer models by proposing AttentionDrop, a family of regularization techniques that operate on self-attention distributions, resulting in improved accuracy, calibration, and adversarial robustness over baselines like Dropout and R-Drop.

Transformer-based architectures achieve state-of-the-art performance across a wide range of tasks in natural language processing, computer vision, and speech processing. However, their immense capacity often leads to overfitting, especially when training data is limited or noisy. In this research, a unified family of stochastic regularization techniques has been proposed, i.e. AttentionDrop with its three different variants, which operate directly on the self-attention distributions. Hard Attention Masking randomly zeroes out top-k attention logits per query to encourage diverse context utilization, Blurred Attention Smoothing applies a dynamic Gaussian convolution over attention logits to diffuse overly peaked distributions, and Consistency-Regularized AttentionDrop enforces output stability under multiple independent AttentionDrop perturbations via a KL-based consistency loss. Results achieved in the study demonstrate that AttentionDrop consistently improves accuracy, calibration, and adversarial robustness over standard Dropout, DropConnect, and R-Drop baselines

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes