LGApr 2, 2024

What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks

arXiv:2404.01601v122 citationsh-index: 3ICML
Originality Incremental advance
AI Analysis

This work provides incremental insights into transformer architecture design for researchers in sequence learning, by systematically analyzing depth requirements for specific capabilities.

The paper investigates how transformer depth affects performance on sequence learning tasks, showing that one layer excels at memorization, two layers are needed for reasoning and generalization, and three layers may be required for contextual generalization, with numerical experiments supporting these findings.

We study the capabilities of the transformer architecture with varying depth. Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to perform memorization, reasoning, generalization, and contextual generalization. We show a transformer with only one attention layer can excel in memorization but falls short in other tasks. Then, we show that exhibiting reasoning and generalization ability requires the transformer to have at least two attention layers, while context generalization ability may necessitate three attention layers. Additionally, we identify a class of simple operations that a single attention layer can execute, and show that the complex tasks can be approached as the combinations of these simple operations and thus can be resolved by stacking multiple attention layers. This sheds light on studying more practical and complex tasks beyond our design. Numerical experiments corroborate our theoretical findings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes