LG CLMar 10

Exclusive Self Attention

arXiv:2603.09078v16.32 citationsh-index: 19

Predicted impact top 34% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This addresses a bottleneck in Transformer-based language modeling for AI applications, offering an incremental but effective modification.

The paper tackles the problem of improving Transformer sequence modeling by introducing exclusive self attention (XSA), which constrains attention to exclude self-position information, resulting in consistent performance gains over standard self attention across model sizes up to 2.7B parameters and larger improvements with longer sequences.

We introduce exclusive self attention (XSA), a simple modification of self attention (SA) that improves Transformer's sequence modeling performance. The key idea is to constrain attention to capture only information orthogonal to the token's own value vector (thus excluding information of self position), encouraging better context modeling. Evaluated on the standard language modeling task, XSA consistently outperforms SA across model sizes up to 2.7B parameters and shows increasingly larger gains as sequence length grows.

View on arXiv PDF

Similar