LGFeb 21, 2024

Linear Transformers are Versatile In-Context Learners

Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge

arXiv:2402.14180v222.735 citationsh-index: 16NIPS

Originality Incremental advance

AI Analysis

This work addresses the challenge of robust in-context learning for machine learning practitioners, though it is incremental as it builds on prior findings about linear transformers.

The paper tackles the problem of understanding the in-context learning capabilities of linear transformers, particularly in noisy data scenarios, and demonstrates that they can discover a novel optimization algorithm that matches or surpasses baseline performance.

Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in handling more complex problems remains unexplored. In this paper, we prove that each layer of a linear transformer maintains a weight vector for an implicit linear regression problem and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We analyze this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.

View on arXiv PDF

Similar