How Smoothing is N-simplicial Attention?
This work addresses a computational bottleneck in graph-based models for machine learning researchers, but it is incremental as it builds on existing attention paradigms.
The paper tackles the problem of extending attention mechanisms from pairwise to higher-order interactions by introducing N-simplicial attention with cost-effective simplex selection, and finds that it suffers from over-smoothing despite enabling more complex message-passing.
Going from pure Multilayer Perceptron (MLP) to a learnable graph message-passing mechanism at each layer has been foundational to state-of-the-art results, despite the computational trade-off (e.g. GATs or Transformers). To go a step further, in this work, we introduce N-simplicial attention, going from pairwise token similarity to higher-order interactions, and adapt it for Rotary Position Embeddings (RoPE). To help manage the increased complexity, we propose a cost-effective simplex selection enabling the model to focus its computation load onto the more task-sensitive interactions. Beyond these core mechanisms, we study how smoothing N-simplicial attention is by deriving a Lipschitz upper-bound and by demonstrating that by itself it also suffers from over-smoothing, despite opening the attention message-passing to higher-order interactions.