LG MLOct 29, 2024

Abrupt Learning in Transformers: A Case Study on Matrix Completion

Pulkit Gopalani, Ekdeep Singh Lubana, Wei Hu

arXiv:2410.22244v115.011 citationsh-index: 18NIPS

Originality Incremental advance

AI Analysis

This provides insights into training dynamics for researchers studying Transformer optimization, but it is incremental as it focuses on a specific case study.

The paper investigates abrupt learning in Transformers by framing low-rank matrix completion as a masked language modeling task, showing that a BERT model achieves low error with a training loss that plateaus and then suddenly drops to near-optimal values.

Recent analysis on the training dynamics of Transformers has unveiled an interesting characteristic: the training loss plateaus for a significant number of training steps, and then suddenly (and sharply) drops to near--optimal values. To understand this phenomenon in depth, we formulate the low-rank matrix completion problem as a masked language modeling (MLM) task, and show that it is possible to train a BERT model to solve this task to low error. Furthermore, the loss curve shows a plateau early in training followed by a sudden drop to near-optimal values, despite no changes in the training procedure or hyper-parameters. To gain interpretability insights into this sudden drop, we examine the model's predictions, attention heads, and hidden states before and after this transition. Concretely, we observe that (a) the model transitions from simply copying the masked input to accurately predicting the masked entries; (b) the attention heads transition to interpretable patterns relevant to the task; and (c) the embeddings and hidden states encode information relevant to the problem. We also analyze the training dynamics of individual model components to understand the sudden drop in loss.

View on arXiv PDF

Similar