LGAIMLOct 1, 2023

Improving Length-Generalization in Transformers via Task Hinting

arXiv:2310.00726v115 citationsh-index: 32
Originality Incremental advance
AI Analysis

This addresses a critical limitation in transformers for researchers and practitioners working on sequence-based tasks, offering a method to enhance performance on unseen lengths, though it is incremental as it builds on existing multitask training ideas.

The paper tackles the problem of length generalization in transformers for reasoning and arithmetic tasks, such as sorting, by proposing a task hinting approach that trains models on an auxiliary task alongside the main task. The result shows that this method improves test accuracy on sequences of length 100 from less than 1% to over 92% when trained on sequences up to length 20.

It has been observed in recent years that transformers have problems with length generalization for certain types of reasoning and arithmetic tasks. In particular, the performance of a transformer model trained on tasks (say addition) up to a certain length (e.g., 5 digit numbers) drops sharply when applied to longer instances of the same problem. This work proposes an approach based on task hinting towards addressing length generalization. Our key idea is that while training the model on task-specific data, it is helpful to simultaneously train the model to solve a simpler but related auxiliary task as well. We study the classical sorting problem as a canonical example to evaluate our approach. We design a multitask training framework and show that task hinting significantly improve length generalization. For sorting we show that it is possible to train models on data consisting of sequences having length at most $20$, and improve the test accuracy on sequences of length $100$ from less than 1% (for standard training) to more than 92% (via task hinting). Our study uncovers several interesting aspects of length generalization. We observe that while several auxiliary tasks may seem natural a priori, their effectiveness in improving length generalization differs dramatically. We further use probing and visualization-based techniques to understand the internal mechanisms via which the model performs the task, and propose a theoretical construction consistent with the observed learning behaviors of the model. Based on our construction, we show that introducing a small number of length dependent parameters into the training procedure can further boost the performance on unseen lengths. Finally, we also show the efficacy of our task hinting based approach beyond sorting, giving hope that these techniques will be applicable in broader contexts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes