LGFeb 11

Stochastic Parroting in Temporal Attention -- Regulating the Diagonal Sink

arXiv:2602.10956v12 citations
Originality Incremental advance
AI Analysis

This addresses a specific bias in temporal attention for spatio-temporal models, but it is incremental as it builds on prior work on over-squashing.

The paper tackled the problem of diagonal attention sink in temporal attention mechanisms, which biases early tokens, and demonstrated that their regularization methods effectively mitigate this issue.

Spatio-temporal models analyze spatial structures and temporal dynamics, which makes them prone to information degeneration among space and time. Prior literature has demonstrated that over-squashing in causal attention or temporal convolutions creates a bias on the first tokens. To analyze whether such a bias is present in temporal attention mechanisms, we derive sensitivity bounds on the expected value of the Jacobian of a temporal attention layer. We theoretically show how off-diagonal attention scores depend on the sequence length, and that temporal attention matrices suffer a diagonal attention sink. We suggest regularization methods, and experimentally demonstrate their effectiveness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes