LGJun 2, 2023

Centered Self-Attention Layers

Meta AI
arXiv:2306.01610v110 citationsh-index: 63
Originality Incremental advance
AI Analysis

This addresses a fundamental issue in deep learning architectures for researchers and practitioners, though it is incremental as it builds on existing mechanisms.

The paper tackles the oversmoothing problem in transformers and graph neural networks by introducing a correction term to the aggregating operator, which improves performance in weakly supervised segmentation and enables training of very deep architectures.

The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied within deep learning architectures. We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers for different tokens in transformers and different nodes in graph neural networks. Based on our analysis, we present a correction term to the aggregating operator of these mechanisms. Empirically, this simple term eliminates much of the oversmoothing problem in visual transformers, obtaining performance in weakly supervised segmentation that surpasses elaborate baseline methods that introduce multiple auxiliary networks and training phrases. In graph neural networks, the correction term enables the training of very deep architectures more effectively than many recent solutions to the same problem.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes